linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] Ext4: Uninitialized Block Groups
@ 2007-09-19  0:25 Avantika Mathur
  2007-09-19  3:03 ` Andrew Morton
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Avantika Mathur @ 2007-09-19  0:25 UTC (permalink / raw)
  To: linux-kernel, linux-ext4; +Cc: Andreas Dilger, cmm

[-- Attachment #1: Type: text/plain, Size: 23611 bytes --]

from: Andreas Dilger <adilger@clusterfs.com>

In pass1 of e2fsck, every inode table in the fileystem is scanned and checked, 
regardless of whether it is in use.  This is this the most time consuming part 
of the filesystem check.  The unintialized block group feature can greatly 
reduce e2fsck time by eliminating checking of uninitialized inodes.  

With this feature, there is a a high water mark of used inodes for each block 
group.  Block and inode bitmaps can be uninitialized on disk via a flag in the
group descriptor to avoid reading or scanning them at e2fsck time.  A checksum
of each group descriptor is used to ensure that corruption in the group
descriptor's bit flags does not cause incorrect operation.

The feature is enabled through a mkfs option

	mke2fs /dev/ -O uninit_groups

A patch adding support for uninitialized block groups to e2fsprogs tools has 
been posted to the linux-ext4 mailing list.

The patches have been stress tested with fsstress and fsx.  In performance 
tests testing e2fsck time, we have seen that e2fsck time on ext3 grows 
linearly with the total number of inodes in the filesytem.  In ext4 with the 
uninitialized block groups feature, the e2fsck time is constant, based 
solely on the number of used inodes rather than the total inode count.  
Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
greatly reduce e2fsck time for users.  With performance improvement of 2-20 
times, depending on how full the filesystem is.

The attached graph shows the major improvements in e2fsck times in filesystems
with a large total inode count, but few inodes in use.  

In each group descriptor if we have

EXT4_BG_INODE_UNINIT set in bg_flags:
        Inode table is not initialized/used in this group. So we can skip
        the consistency check during fsck.
EXT4_BG_BLOCK_UNINIT set in bg_flags:
        No block in the group is used. So we can skip the block bitmap
        verification for this group.

We also add two new fields to group descriptor as a part of
uninitialized group patch.

        __le16  bg_itable_unused;       /* Unused inodes count */
        __le16  bg_checksum;            /* crc16(sb_uuid+group+desc) */


bg_itable_unused:

If we have EXT4_BG_INODE_UNINIT not set in bg_flags
then bg_itable_unused will give the offset within
the inode table till the inodes are used. This can be
used by fsck to skip list of inodes that are marked unused.


bg_checksum:
Now that we depend on bg_flags and bg_itable_unused to determine
the block and inode usage, we need to make sure group descriptor
is not corrupt. We add checksum to group descriptor to
detect corruption. If the descriptor is found to be corrupt, we
mark all the blocks and inodes in the group used.


Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Avantika Mathur <mathur@us.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

---
 fs/ext4/balloc.c        |   94 +++++++++++++++++++++++++++++++++-
 fs/ext4/group.h         |   29 ++++++++++
 fs/ext4/ialloc.c        |  132 ++++++++++++++++++++++++++++++++++++++++++++----
 fs/ext4/resize.c        |    2 
 fs/ext4/super.c         |   96 ++++++++++++++++++++++++++++++++++
 include/linux/ext4_fs.h |   14 ++++-
 6 files changed, 352 insertions(+), 15 deletions(-)

Index: linux-2.6.23-rc6/fs/ext4/balloc.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/ext4/balloc.c	2007-09-10 19:50:29.000000000 -0700
+++ linux-2.6.23-rc6/fs/ext4/balloc.c	2007-09-18 11:38:24.000000000 -0700
@@ -20,6 +20,7 @@
 #include <linux/quotaops.h>
 #include <linux/buffer_head.h>
 
+#include "group.h"
 /*
  * balloc.c contains the blocks allocation and deallocation routines
  */
@@ -42,6 +43,74 @@
 
 }
 
+/* Initializes an uninitialized block bitmap if given, and returns the
+ * number of blocks free in the group. */
+unsigned ext4_init_block_bitmap(struct super_block *sb, struct buffer_head *bh,
+				int block_group, struct ext4_group_desc *gdp)
+{
+	unsigned long start;
+	int bit, bit_max;
+	unsigned free_blocks;
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+
+	if (bh) {
+		J_ASSERT_BH(bh, buffer_locked(bh));
+
+		/* If checksum is bad mark all blocks used to prevent allocation
+		 * essentially implementing a per-group read-only flag. */
+		if (!ext4_group_desc_csum_verify(sbi, block_group, gdp)) {
+			ext4_error(sb, __FUNCTION__,
+				   "Checksum bad for group %u\n", block_group);
+			gdp->bg_free_blocks_count = 0;
+			gdp->bg_free_inodes_count = 0;
+			gdp->bg_itable_unused = 0;
+			memset(bh->b_data, 0xff, sb->s_blocksize);
+			return 0;
+		}
+		memset(bh->b_data, 0, sb->s_blocksize);
+	}
+
+	/* Check for superblock and gdt backups in this group */
+	bit_max = ext4_bg_has_super(sb, block_group);
+
+	if (!EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_META_BG) ||
+	    block_group < le32_to_cpu(sbi->s_es->s_first_meta_bg) *
+			  sbi->s_desc_per_block) {
+		if (bit_max) {
+			bit_max += ext4_bg_num_gdb(sb, block_group);
+			bit_max +=
+				le16_to_cpu(sbi->s_es->s_reserved_gdt_blocks);
+		}
+	} else { /* For META_BG_BLOCK_GROUPS */
+		int group_rel = (block_group -
+				 le32_to_cpu(sbi->s_es->s_first_meta_bg)) %
+				EXT4_DESC_PER_BLOCK(sb);
+		if (group_rel == 0 || group_rel == 1 ||
+		    (group_rel == EXT4_DESC_PER_BLOCK(sb) - 1))
+			bit_max += 1;
+	}
+
+	/* Last and first groups are always initialized */
+	free_blocks = EXT4_BLOCKS_PER_GROUP(sb) - bit_max;
+
+	if (bh) {
+		for (bit = 0; bit < bit_max; bit++)
+			ext4_set_bit(bit, bh->b_data);
+
+		start = block_group * EXT4_BLOCKS_PER_GROUP(sb) +
+			le32_to_cpu(sbi->s_es->s_first_data_block);
+
+		/* Set bits for block and inode bitmaps, and inode table */
+		ext4_set_bit(ext4_block_bitmap(sb, gdp) - start, bh->b_data);
+		ext4_set_bit(ext4_inode_bitmap(sb, gdp) - start, bh->b_data);
+		for (bit = le32_to_cpu(gdp->bg_inode_table) - start,
+		     bit_max = bit + sbi->s_itb_per_group; bit < bit_max; bit++)
+			ext4_set_bit(bit, bh->b_data);
+	}
+
+	return free_blocks - sbi->s_itb_per_group - 2;
+}
+
 /*
  * The free blocks are managed by bitmaps.  A file system contains several
  * blocks groups.  Each group contains 1 bitmap block for blocks, 1 bitmap
@@ -110,16 +179,29 @@
  *
  * Return buffer_head on success or NULL in case of failure.
  */
-static struct buffer_head *
+struct buffer_head *
 read_block_bitmap(struct super_block *sb, unsigned int block_group)
 {
 	struct ext4_group_desc * desc;
 	struct buffer_head * bh = NULL;
 
-	desc = ext4_get_group_desc (sb, block_group, NULL);
+	desc = ext4_get_group_desc(sb, block_group, NULL);
 	if (!desc)
 		goto error_out;
-	bh = sb_bread(sb, ext4_block_bitmap(sb, desc));
+	if (desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
+		bh = sb_getblk(sb, ext4_block_bitmap(sb, desc));
+		if (!buffer_uptodate(bh)) {
+			lock_buffer(bh);
+			if (!buffer_uptodate(bh)) {
+				ext4_init_block_bitmap(sb, bh, block_group,
+						       desc);
+				set_buffer_uptodate(bh);
+			}
+			unlock_buffer(bh);
+		}
+	} else {
+		bh = sb_bread(sb, ext4_block_bitmap(sb,desc));
+	}
 	if (!bh)
 		ext4_error (sb, "read_block_bitmap",
 			    "Cannot read block bitmap - "
@@ -586,6 +668,7 @@
 	desc->bg_free_blocks_count =
 		cpu_to_le16(le16_to_cpu(desc->bg_free_blocks_count) +
 			group_freed);
+	desc->bg_checksum = ext4_group_desc_csum(sbi, block_group, desc);
 	spin_unlock(sb_bgl_lock(sbi, block_group));
 	percpu_counter_mod(&sbi->s_freeblocks_counter, count);
 
@@ -1644,8 +1727,11 @@
 			ret_block, goal_hits, goal_attempts);
 
 	spin_lock(sb_bgl_lock(sbi, group_no));
+	if (gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT))
+		gdp->bg_flags &= cpu_to_le16(~EXT4_BG_BLOCK_UNINIT);
 	gdp->bg_free_blocks_count =
 			cpu_to_le16(le16_to_cpu(gdp->bg_free_blocks_count)-num);
+	gdp->bg_checksum = ext4_group_desc_csum(sbi, group_no, gdp);
 	spin_unlock(sb_bgl_lock(sbi, group_no));
 	percpu_counter_mod(&sbi->s_freeblocks_counter, -num);
 
Index: linux-2.6.23-rc6/fs/ext4/group.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.23-rc6/fs/ext4/group.h	2007-09-18 11:38:24.000000000 -0700
@@ -0,0 +1,29 @@
+/*
+ *  linux/fs/ext4/group.h
+ *
+ * Copyright (C) 2007 Cluster File Systems, Inc
+ *
+ * Author: Andreas Dilger <adilger@clusterfs.com>
+ */
+
+#ifndef _LINUX_EXT4_GROUP_H
+#define _LINUX_EXT4_GROUP_H
+#if defined(CONFIG_CRC16)
+#include <linux/crc16.h>
+#endif
+
+extern __le16 ext4_group_desc_csum(struct ext4_sb_info *sbi, __u32 group,
+				   struct ext4_group_desc *gdp);
+extern int ext4_group_desc_csum_verify(struct ext4_sb_info *sbi, __u32 group,
+				       struct ext4_group_desc *gdp);
+struct buffer_head *read_block_bitmap(struct super_block *sb,
+				      unsigned int block_group);
+extern unsigned ext4_init_block_bitmap(struct super_block *sb,
+				       struct buffer_head *bh, int group,
+				       struct ext4_group_desc *desc);
+#define ext4_free_blocks_after_init(sb, group, desc)			\
+		ext4_init_block_bitmap(sb, NULL, group, desc)
+extern unsigned ext4_init_inode_bitmap(struct super_block *sb,
+				       struct buffer_head *bh, int group,
+				       struct ext4_group_desc *desc);
+#endif /* _LINUX_EXT4_GROUP_H */
Index: linux-2.6.23-rc6/fs/ext4/ialloc.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/ext4/ialloc.c	2007-09-10 19:50:29.000000000 -0700
+++ linux-2.6.23-rc6/fs/ext4/ialloc.c	2007-09-18 11:38:24.000000000 -0700
@@ -28,6 +28,7 @@
 
 #include "xattr.h"
 #include "acl.h"
+#include "group.h"
 
 /*
  * ialloc.c contains the inodes allocation and deallocation routines
@@ -43,6 +44,52 @@
  * the free blocks count in the block.
  */
 
+/*
+ * To avoid calling the atomic setbit hundreds or thousands of times, we only
+ * need to use it within a single byte (to ensure we get endianness right).
+ * We can use memset for the rest of the bitmap as there are no other users.
+ */
+static void mark_bitmap_end(int start_bit, int end_bit, char *bitmap)
+{
+	int i;
+
+	if (start_bit >= end_bit)
+		return;
+
+	ext4_debug("mark end bits +%d through +%d used\n", start_bit, end_bit);
+	for (i = start_bit; i < ((start_bit + 7) & ~7UL); i++)
+		ext4_set_bit(i, bitmap);
+	if (i < end_bit)
+		memset(bitmap + (i >> 3), 0xff, (end_bit - i) >> 3);
+}
+
+/* Initializes an uninitialized inode bitmap */
+unsigned ext4_init_inode_bitmap(struct super_block *sb,
+				struct buffer_head *bh, int block_group,
+				struct ext4_group_desc *gdp)
+{
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+
+	J_ASSERT_BH(bh, buffer_locked(bh));
+
+	/* If checksum is bad mark all blocks and inodes use to prevent
+	 * allocation, essentially implementing a per-group read-only flag. */
+	if (!ext4_group_desc_csum_verify(sbi, block_group, gdp)) {
+		ext4_error(sb, __FUNCTION__, "Checksum bad for group %u\n",
+			   block_group);
+		gdp->bg_free_blocks_count = 0;
+		gdp->bg_free_inodes_count = 0;
+		gdp->bg_itable_unused = 0;
+		memset(bh->b_data, 0xff, sb->s_blocksize);
+		return 0;
+	}
+
+	memset(bh->b_data, 0, (EXT4_INODES_PER_GROUP(sb) + 7) / 8);
+	mark_bitmap_end(EXT4_INODES_PER_GROUP(sb), EXT4_BLOCKS_PER_GROUP(sb),
+			bh->b_data);
+
+	return EXT4_INODES_PER_GROUP(sb);
+}
 
 /*
  * Read the inode allocation bitmap for a given block_group, reading
@@ -59,8 +106,20 @@
 	desc = ext4_get_group_desc(sb, block_group, NULL);
 	if (!desc)
 		goto error_out;
-
-	bh = sb_bread(sb, ext4_inode_bitmap(sb, desc));
+	if (desc->bg_flags & cpu_to_le16(EXT4_BG_INODE_UNINIT)) {
+		bh = sb_getblk(sb, ext4_inode_bitmap(sb, desc));
+		if (!buffer_uptodate(bh)) {
+			lock_buffer(bh);
+			if (!buffer_uptodate(bh)) {
+				ext4_init_inode_bitmap(sb, bh, block_group,
+						       desc);
+				set_buffer_uptodate(bh);
+			}
+			unlock_buffer(bh);
+		}
+	} else {
+		bh = sb_bread(sb, ext4_inode_bitmap(sb, desc));
+	}
 	if (!bh)
 		ext4_error(sb, "read_inode_bitmap",
 			    "Cannot read inode bitmap - "
@@ -169,6 +228,8 @@
 			if (is_directory)
 				gdp->bg_used_dirs_count = cpu_to_le16(
 				  le16_to_cpu(gdp->bg_used_dirs_count) - 1);
+			gdp->bg_checksum = ext4_group_desc_csum(sbi,
+							block_group, gdp);
 			spin_unlock(sb_bgl_lock(sbi, block_group));
 			percpu_counter_inc(&sbi->s_freeinodes_counter);
 			if (is_directory)
@@ -438,7 +499,7 @@
 	struct ext4_sb_info *sbi;
 	int err = 0;
 	struct inode *ret;
-	int i;
+	int i, free = 0;
 
 	/* Cannot create files in a deleted directory */
 	if (!dir || !dir->i_nlink)
@@ -520,11 +581,13 @@
 	goto out;
 
 got:
-	ino += group * EXT4_INODES_PER_GROUP(sb) + 1;
-	if (ino < EXT4_FIRST_INO(sb) || ino > le32_to_cpu(es->s_inodes_count)) {
-		ext4_error (sb, "ext4_new_inode",
-			    "reserved inode or inode > inodes count - "
-			    "block_group = %d, inode=%lu", group, ino);
+	ino++;
+	if ((group == 0 && ino < EXT4_FIRST_INO(sb)) ||
+	    ino > EXT4_INODES_PER_GROUP(sb)) {
+		ext4_error(sb, __FUNCTION__,
+			   "reserved inode or inode > inodes count - "
+			   "block_group = %d, inode=%lu", group,
+			   ino + group * EXT4_INODES_PER_GROUP(sb));
 		err = -EIO;
 		goto fail;
 	}
@@ -532,13 +595,64 @@
 	BUFFER_TRACE(bh2, "get_write_access");
 	err = ext4_journal_get_write_access(handle, bh2);
 	if (err) goto fail;
+
+	/* We may have to initialize the block bitmap if it isn't already */
+	if (EXT4_HAS_RO_COMPAT_FEATURE(sb, EXT4_FEATURE_RO_COMPAT_GDT_CSUM) &&
+	    gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
+		struct buffer_head *block_bh = read_block_bitmap(sb, group);
+
+		BUFFER_TRACE(block_bh, "get block bitmap access");
+		err = ext4_journal_get_write_access(handle, block_bh);
+		if (err) {
+			brelse(block_bh);
+			goto fail;
+		}
+
+		free = 0;
+		spin_lock(sb_bgl_lock(sbi, group));
+		/* recheck and clear flag under lock if we still need to */
+		if (gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
+			gdp->bg_flags &= cpu_to_le16(~EXT4_BG_BLOCK_UNINIT);
+			free = ext4_free_blocks_after_init(sb, group, gdp);
+			gdp->bg_free_blocks_count = cpu_to_le16(free);
+		}
+		spin_unlock(sb_bgl_lock(sbi, group));
+
+		/* Don't need to dirty bitmap block if we didn't change it */
+		if (free) {
+			BUFFER_TRACE(block_bh, "dirty block bitmap");
+			err = ext4_journal_dirty_metadata(handle, block_bh);
+		}
+
+		brelse(block_bh);
+		if (err)
+			goto fail;
+	}
+
 	spin_lock(sb_bgl_lock(sbi, group));
+	/* If we didn't allocate from within the initialized part of the inode
+	 * table then we need to initialize up to this inode. */
+	if (EXT4_HAS_RO_COMPAT_FEATURE(sb, EXT4_FEATURE_RO_COMPAT_GDT_CSUM)) {
+		if (gdp->bg_flags & cpu_to_le16(EXT4_BG_INODE_UNINIT)) {
+			gdp->bg_flags &= cpu_to_le16(~EXT4_BG_INODE_UNINIT);
+			free = EXT4_INODES_PER_GROUP(sb);
+		} else {
+			free = EXT4_INODES_PER_GROUP(sb) -
+				le16_to_cpu(gdp->bg_itable_unused);
+		}
+
+		if (ino > free)
+			gdp->bg_itable_unused =
+				cpu_to_le16(EXT4_INODES_PER_GROUP(sb) - ino);
+	}
+
 	gdp->bg_free_inodes_count =
 		cpu_to_le16(le16_to_cpu(gdp->bg_free_inodes_count) - 1);
 	if (S_ISDIR(mode)) {
 		gdp->bg_used_dirs_count =
 			cpu_to_le16(le16_to_cpu(gdp->bg_used_dirs_count) + 1);
 	}
+	gdp->bg_checksum = ext4_group_desc_csum(sbi, group, gdp);
 	spin_unlock(sb_bgl_lock(sbi, group));
 	BUFFER_TRACE(bh2, "call ext4_journal_dirty_metadata");
 	err = ext4_journal_dirty_metadata(handle, bh2);
@@ -560,7 +674,7 @@
 		inode->i_gid = current->fsgid;
 	inode->i_mode = mode;
 
-	inode->i_ino = ino;
+	inode->i_ino = ino + group * EXT4_INODES_PER_GROUP(sb);
 	/* This is the optimal IO size (for stat), not the fs block size */
 	inode->i_blocks = 0;
 	inode->i_mtime = inode->i_atime = inode->i_ctime = ei->i_crtime =
Index: linux-2.6.23-rc6/fs/ext4/resize.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/ext4/resize.c	2007-09-10 19:50:29.000000000 -0700
+++ linux-2.6.23-rc6/fs/ext4/resize.c	2007-09-18 11:38:24.000000000 -0700
@@ -16,6 +16,7 @@
 #include <linux/errno.h>
 #include <linux/slab.h>
 
+#include "group.h"
 
 #define outside(b, first, last)	((b) < (first) || (b) >= (last))
 #define inside(b, first, last)	((b) >= (first) && (b) < (last))
@@ -842,6 +843,7 @@
 	ext4_inode_table_set(sb, gdp, input->inode_table); /* LV FIXME */
 	gdp->bg_free_blocks_count = cpu_to_le16(input->free_blocks_count);
 	gdp->bg_free_inodes_count = cpu_to_le16(EXT4_INODES_PER_GROUP(sb));
+	gdp->bg_checksum = ext4_group_desc_csum(sbi, input->group, gdp);
 
 	/*
 	 * Make the new blocks and inodes valid next.  We do this before
Index: linux-2.6.23-rc6/fs/ext4/super.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/ext4/super.c	2007-09-18 11:37:28.000000000 -0700
+++ linux-2.6.23-rc6/fs/ext4/super.c	2007-09-18 11:38:24.000000000 -0700
@@ -43,6 +43,7 @@
 #include "xattr.h"
 #include "acl.h"
 #include "namei.h"
+#include "group.h"
 
 static int ext4_load_journal(struct super_block *, struct ext4_super_block *,
 			     unsigned long journal_devnum);
@@ -1247,6 +1248,93 @@
 	return res;
 }
 
+#if !defined(CONFIG_CRC16)
+/** CRC table for the CRC-16. The poly is 0x8005 (x16 + x15 + x2 + 1) */
+__u16 const crc16_table[256] = {
+	0x0000, 0xC0C1, 0xC181, 0x0140, 0xC301, 0x03C0, 0x0280, 0xC241,
+	0xC601, 0x06C0, 0x0780, 0xC741, 0x0500, 0xC5C1, 0xC481, 0x0440,
+	0xCC01, 0x0CC0, 0x0D80, 0xCD41, 0x0F00, 0xCFC1, 0xCE81, 0x0E40,
+	0x0A00, 0xCAC1, 0xCB81, 0x0B40, 0xC901, 0x09C0, 0x0880, 0xC841,
+	0xD801, 0x18C0, 0x1980, 0xD941, 0x1B00, 0xDBC1, 0xDA81, 0x1A40,
+	0x1E00, 0xDEC1, 0xDF81, 0x1F40, 0xDD01, 0x1DC0, 0x1C80, 0xDC41,
+	0x1400, 0xD4C1, 0xD581, 0x1540, 0xD701, 0x17C0, 0x1680, 0xD641,
+	0xD201, 0x12C0, 0x1380, 0xD341, 0x1100, 0xD1C1, 0xD081, 0x1040,
+	0xF001, 0x30C0, 0x3180, 0xF141, 0x3300, 0xF3C1, 0xF281, 0x3240,
+	0x3600, 0xF6C1, 0xF781, 0x3740, 0xF501, 0x35C0, 0x3480, 0xF441,
+	0x3C00, 0xFCC1, 0xFD81, 0x3D40, 0xFF01, 0x3FC0, 0x3E80, 0xFE41,
+	0xFA01, 0x3AC0, 0x3B80, 0xFB41, 0x3900, 0xF9C1, 0xF881, 0x3840,
+	0x2800, 0xE8C1, 0xE981, 0x2940, 0xEB01, 0x2BC0, 0x2A80, 0xEA41,
+	0xEE01, 0x2EC0, 0x2F80, 0xEF41, 0x2D00, 0xEDC1, 0xEC81, 0x2C40,
+	0xE401, 0x24C0, 0x2580, 0xE541, 0x2700, 0xE7C1, 0xE681, 0x2640,
+	0x2200, 0xE2C1, 0xE381, 0x2340, 0xE101, 0x21C0, 0x2080, 0xE041,
+	0xA001, 0x60C0, 0x6180, 0xA141, 0x6300, 0xA3C1, 0xA281, 0x6240,
+	0x6600, 0xA6C1, 0xA781, 0x6740, 0xA501, 0x65C0, 0x6480, 0xA441,
+	0x6C00, 0xACC1, 0xAD81, 0x6D40, 0xAF01, 0x6FC0, 0x6E80, 0xAE41,
+	0xAA01, 0x6AC0, 0x6B80, 0xAB41, 0x6900, 0xA9C1, 0xA881, 0x6840,
+	0x7800, 0xB8C1, 0xB981, 0x7940, 0xBB01, 0x7BC0, 0x7A80, 0xBA41,
+	0xBE01, 0x7EC0, 0x7F80, 0xBF41, 0x7D00, 0xBDC1, 0xBC81, 0x7C40,
+	0xB401, 0x74C0, 0x7580, 0xB541, 0x7700, 0xB7C1, 0xB681, 0x7640,
+	0x7200, 0xB2C1, 0xB381, 0x7340, 0xB101, 0x71C0, 0x7080, 0xB041,
+	0x5000, 0x90C1, 0x9181, 0x5140, 0x9301, 0x53C0, 0x5280, 0x9241,
+	0x9601, 0x56C0, 0x5780, 0x9741, 0x5500, 0x95C1, 0x9481, 0x5440,
+	0x9C01, 0x5CC0, 0x5D80, 0x9D41, 0x5F00, 0x9FC1, 0x9E81, 0x5E40,
+	0x5A00, 0x9AC1, 0x9B81, 0x5B40, 0x9901, 0x59C0, 0x5880, 0x9841,
+	0x8801, 0x48C0, 0x4980, 0x8941, 0x4B00, 0x8BC1, 0x8A81, 0x4A40,
+	0x4E00, 0x8EC1, 0x8F81, 0x4F40, 0x8D01, 0x4DC0, 0x4C80, 0x8C41,
+	0x4400, 0x84C1, 0x8581, 0x4540, 0x8701, 0x47C0, 0x4680, 0x8641,
+	0x8201, 0x42C0, 0x4380, 0x8341, 0x4100, 0x81C1, 0x8081, 0x4040
+};
+
+static inline __u16 crc16_byte(__u16 crc, const __u8 data)
+{
+	return (crc >> 8) ^ crc16_table[(crc ^ data) & 0xff];
+}
+
+__u16 crc16(__u16 crc, __u8 const *buffer, size_t len)
+{
+	while (len--)
+		crc = crc16_byte(crc, *buffer++);
+	return crc;
+}
+#endif
+
+__le16 ext4_group_desc_csum(struct ext4_sb_info *sbi, __u32 block_group,
+			    struct ext4_group_desc *gdp)
+{
+	__u16 crc = 0;
+
+	if (sbi->s_es->s_feature_ro_compat &
+	    cpu_to_le32(EXT4_FEATURE_RO_COMPAT_GDT_CSUM)) {
+		int offset = offsetof(struct ext4_group_desc, bg_checksum);
+		__le32 le_group = cpu_to_le32(block_group);
+
+		crc = crc16(~0, sbi->s_es->s_uuid, sizeof(sbi->s_es->s_uuid));
+		crc = crc16(crc, (__u8 *)&le_group, sizeof(le_group));
+		crc = crc16(crc, (__u8 *)gdp, offset);
+		offset += sizeof(gdp->bg_checksum); /* skip checksum */
+		/* for checksum of struct ext4_group_desc do the rest...*/
+		if ((sbi->s_es->s_feature_incompat &
+		     cpu_to_le32(EXT4_FEATURE_INCOMPAT_64BIT)) &&
+		    offset < le16_to_cpu(sbi->s_es->s_desc_size))
+			crc = crc16(crc, (__u8 *)gdp + offset,
+				    le16_to_cpu(sbi->s_es->s_desc_size) -
+					offset);
+	}
+
+	return cpu_to_le16(crc);
+}
+
+int ext4_group_desc_csum_verify(struct ext4_sb_info *sbi, __u32 block_group,
+				struct ext4_group_desc *gdp)
+{
+	if ((sbi->s_es->s_feature_ro_compat &
+	     cpu_to_le32(EXT4_FEATURE_RO_COMPAT_GDT_CSUM)) &&
+	    (gdp->bg_checksum != ext4_group_desc_csum(sbi, block_group, gdp)))
+		return 0;
+
+	return 1;
+}
+
 /* Called at mount-time, super-block is locked */
 static int ext4_check_descriptors (struct super_block * sb)
 {
@@ -1301,6 +1389,14 @@
 				    i, inode_table);
 			return 0;
 		}
+		if (!ext4_group_desc_csum_verify(sbi, i, gdp)) {
+			ext4_error(sb, __FUNCTION__,
+				   "Checksum for group %d failed (%u!=%u)\n", i,
+				   le16_to_cpu(ext4_group_desc_csum(sbi, i,
+								    gdp)),
+				   le16_to_cpu(gdp->bg_checksum));
+			return 0;
+		}
 		first_block += EXT4_BLOCKS_PER_GROUP(sb);
 		gdp = (struct ext4_group_desc *)
 			((__u8 *)gdp + EXT4_DESC_SIZE(sb));
Index: linux-2.6.23-rc6/include/linux/ext4_fs.h
===================================================================
--- linux-2.6.23-rc6.orig/include/linux/ext4_fs.h	2007-09-18 11:37:28.000000000 -0700
+++ linux-2.6.23-rc6/include/linux/ext4_fs.h	2007-09-18 11:38:24.000000000 -0700
@@ -123,19 +123,25 @@
  */
 struct ext4_group_desc
 {
-	__le32	bg_block_bitmap;		/* Blocks bitmap block */
-	__le32	bg_inode_bitmap;		/* Inodes bitmap block */
+	__le32	bg_block_bitmap;	/* Blocks bitmap block */
+	__le32	bg_inode_bitmap;	/* Inodes bitmap block */
 	__le32	bg_inode_table;		/* Inodes table block */
 	__le16	bg_free_blocks_count;	/* Free blocks count */
 	__le16	bg_free_inodes_count;	/* Free inodes count */
 	__le16	bg_used_dirs_count;	/* Directories count */
-	__u16	bg_flags;
-	__u32	bg_reserved[3];
+	__u16	bg_flags;		/* EXT4_BG_flags (INODE_UNINIT, etc) */
+	__u32	bg_reserved[2];		/* Likely block/inode bitmap checksum */
+	__le16  bg_itable_unused;	/* Unused inodes count */
+	__le16  bg_checksum;		/* crc16(sb_uuid+group+desc) */
 	__le32	bg_block_bitmap_hi;	/* Blocks bitmap block MSB */
 	__le32	bg_inode_bitmap_hi;	/* Inodes bitmap block MSB */
 	__le32	bg_inode_table_hi;	/* Inodes table block MSB */
 };
 
+#define EXT4_BG_INODE_UNINIT	0x0001 /* Inode table/bitmap not in use */
+#define EXT4_BG_BLOCK_UNINIT	0x0002 /* Block bitmap not in use */
+#define EXT4_BG_INODE_ZEROED	0x0004 /* On-disk itable initialized to zero */
+
 #ifdef __KERNEL__
 #include <linux/ext4_fs_i.h>
 #include <linux/ext4_fs_sb.h>
@@ -693,6 +699,7 @@
 #define EXT4_FEATURE_RO_COMPAT_SPARSE_SUPER	0x0001
 #define EXT4_FEATURE_RO_COMPAT_LARGE_FILE	0x0002
 #define EXT4_FEATURE_RO_COMPAT_BTREE_DIR	0x0004
+#define EXT4_FEATURE_RO_COMPAT_GDT_CSUM		0x0010
 #define EXT4_FEATURE_RO_COMPAT_DIR_NLINK	0x0020
 #define EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE	0x0040
 
@@ -712,6 +719,7 @@
 					 EXT4_FEATURE_INCOMPAT_64BIT)
 #define EXT4_FEATURE_RO_COMPAT_SUPP	(EXT4_FEATURE_RO_COMPAT_SPARSE_SUPER| \
 					 EXT4_FEATURE_RO_COMPAT_LARGE_FILE| \
+					 EXT4_FEATURE_RO_COMPAT_GDT_CSUM| \
 					 EXT4_FEATURE_RO_COMPAT_DIR_NLINK | \
 					 EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE | \
 					 EXT4_FEATURE_RO_COMPAT_BTREE_DIR)



[-- Attachment #2: e2fsck-uninit.pdf --]
[-- Type: application/pdf, Size: 33539 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] Ext4: Uninitialized Block Groups
  2007-09-19  0:25 [PATCH] Ext4: Uninitialized Block Groups Avantika Mathur
@ 2007-09-19  3:03 ` Andrew Morton
  2007-09-19  6:30   ` Andreas Dilger
  2007-09-19  3:05 ` Andrew Morton
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 12+ messages in thread
From: Andrew Morton @ 2007-09-19  3:03 UTC (permalink / raw)
  To: Avantika Mathur; +Cc: linux-kernel, linux-ext4, Andreas Dilger, cmm

On Tue, 18 Sep 2007 17:25:31 -0700 Avantika Mathur <mathur@linux.vnet.ibm.com> wrote:

> +#if !defined(CONFIG_CRC16)
> +/** CRC table for the CRC-16. The poly is 0x8005 (x16 + x15 + x2 + 1) */
> +__u16 const crc16_table[256] = {
> +	0x0000, 0xC0C1, 0xC181, 0x0140, 0xC301, 0x03C0, 0x0280, 0xC241,
> +	0xC601, 0x06C0, 0x0780, 0xC741, 0x0500, 0xC5C1, 0xC481, 0x0440,
> +	0xCC01, 0x0CC0, 0x0D80, 0xCD41, 0x0F00, 0xCFC1, 0xCE81, 0x0E40,
> +	0x0A00, 0xCAC1, 0xCB81, 0x0B40, 0xC901, 0x09C0, 0x0880, 0xC841,
> +	0xD801, 0x18C0, 0x1980, 0xD941, 0x1B00, 0xDBC1, 0xDA81, 0x1A40,
> +	0x1E00, 0xDEC1, 0xDF81, 0x1F40, 0xDD01, 0x1DC0, 0x1C80, 0xDC41,

That's rather sad.  A plain old "depends on" would be better.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] Ext4: Uninitialized Block Groups
  2007-09-19  0:25 [PATCH] Ext4: Uninitialized Block Groups Avantika Mathur
  2007-09-19  3:03 ` Andrew Morton
@ 2007-09-19  3:05 ` Andrew Morton
  2007-09-19 12:06 ` Valerie Clement
  2007-09-20 23:22 ` Andrew Morton
  3 siblings, 0 replies; 12+ messages in thread
From: Andrew Morton @ 2007-09-19  3:05 UTC (permalink / raw)
  To: Avantika Mathur; +Cc: linux-kernel, linux-ext4, Andreas Dilger, cmm

On Tue, 18 Sep 2007 17:25:31 -0700 Avantika Mathur <mathur@linux.vnet.ibm.com> wrote:

> +
> +__u16 crc16(__u16 crc, __u8 const *buffer, size_t len)

And is we really really have to do this, then the ext4-private crc16() 
should have static scope.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] Ext4: Uninitialized Block Groups
  2007-09-19  3:03 ` Andrew Morton
@ 2007-09-19  6:30   ` Andreas Dilger
  2007-09-19 22:54     ` Avantika Mathur
  0 siblings, 1 reply; 12+ messages in thread
From: Andreas Dilger @ 2007-09-19  6:30 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Avantika Mathur, linux-kernel, linux-ext4, cmm

On Sep 18, 2007  20:03 -0700, Andrew Morton wrote:
> On Tue, 18 Sep 2007 17:25:31 -0700 Avantika Mathur <mathur@linux.vnet.ibm.com> wrote:
> 
> > +#if !defined(CONFIG_CRC16)
> > +/** CRC table for the CRC-16. The poly is 0x8005 (x16 + x15 + x2 + 1) */
> > +__u16 const crc16_table[256] = {
> > +	0x0000, 0xC0C1, 0xC181, 0x0140, 0xC301, 0x03C0, 0x0280, 0xC241,
> 
> That's rather sad.  A plain old "depends on" would be better.

My bad.  We wrote this patch and it had to run on older kernels that might
not even have lib/crc16.c (it was added around 2.6.15 or so, so e.g. RHEL4
doesn't have it).  I forgot to remove it from the upstream submission,
and since it didn't cause problems nobody else complained...

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] Ext4: Uninitialized Block Groups
  2007-09-19 12:06 ` Valerie Clement
@ 2007-09-19 11:34   ` Aneesh Kumar K.V
  2007-09-19 12:01     ` Valerie Clement
  2007-09-19 16:42   ` Andreas Dilger
  1 sibling, 1 reply; 12+ messages in thread
From: Aneesh Kumar K.V @ 2007-09-19 11:34 UTC (permalink / raw)
  To: Valerie Clement; +Cc: Avantika Mathur, linux-ext4, Andreas Dilger, cmm



Valerie Clement wrote:
> Hi Avantika,
> I ran some tests with the uninit_groups feature enabled and got error messages
> when running e2fsck on my ext4 partition. e2fsck complains of an "invalid 
> unused inodes count" in some group descriptors.
> These errors occur when checking groups which have only one inode in use. The 
> "free inodes" count has been decremented by one in these groups but not the 
> "unused inodes" count.
> 
> The following patch fixes the problem.
> 
> Andreas, could you check if my patch is correct?
> Thanks a lot,
>   Valérie
> 
> Index: linux-2.6.23-rc6/fs/ext4/ialloc.c
> ===================================================================
> --- linux-2.6.23-rc6.orig/fs/ext4/ialloc.c	2007-09-19 11:31:01.000000000 +0200
> +++ linux-2.6.23-rc6/fs/ext4/ialloc.c	2007-09-19 11:31:41.000000000 +0200
> @@ -633,13 +633,10 @@ got:
>  	/* If we didn't allocate from within the initialized part of the inode
>  	 * table then we need to initialize up to this inode. */
>  	if (EXT4_HAS_RO_COMPAT_FEATURE(sb, EXT4_FEATURE_RO_COMPAT_GDT_CSUM)) {
> -		if (gdp->bg_flags & cpu_to_le16(EXT4_BG_INODE_UNINIT)) {
> +		if (gdp->bg_flags & cpu_to_le16(EXT4_BG_INODE_UNINIT))
>  			gdp->bg_flags &= cpu_to_le16(~EXT4_BG_INODE_UNINIT);
> -			free = EXT4_INODES_PER_GROUP(sb);
> -		} else {
> -			free = EXT4_INODES_PER_GROUP(sb) -
> +		free = EXT4_INODES_PER_GROUP(sb) -
>  				le16_to_cpu(gdp->bg_itable_unused);
> -		}
> 
>

the variable free is confusingly named here. It is not the free inode count.
rather it indicate the last used relative inode number in the group. How about
the below ?

-aneesh

diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 4250c02..cfe2e09 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -633,17 +633,21 @@ got:
 	/* If we didn't allocate from within the initialized part of the inode
 	 * table then we need to initialize up to this inode. */
 	if (EXT4_HAS_RO_COMPAT_FEATURE(sb, EXT4_FEATURE_RO_COMPAT_GDT_CSUM)) {
-		if (gdp->bg_flags & cpu_to_le16(EXT4_BG_INODE_UNINIT)) {
+		if (gdp->bg_flags & cpu_to_le16(EXT4_BG_INODE_UNINIT))
 			gdp->bg_flags &= cpu_to_le16(~EXT4_BG_INODE_UNINIT);
-			free = EXT4_INODES_PER_GROUP(sb);
-		} else {
-			free = EXT4_INODES_PER_GROUP(sb) -
-				le16_to_cpu(gdp->bg_itable_unused);
-		}
 
-		if (ino > free)
+		/*
+		 * Check the relative inode number against the last used
+		 * relative inode number in this group. if it is greater
+		 * we need to  update the bg_itable_unused count
+		 *
+		 */
+		if (ino > (EXT4_INODES_PER_GROUP(sb) -
+					le16_to_cpu(gdp->bg_itable_unused))) {
+
 			gdp->bg_itable_unused =
 				cpu_to_le16(EXT4_INODES_PER_GROUP(sb) - ino);
+		}
 	}
 
 	gdp->bg_free_inodes_count =

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH] Ext4: Uninitialized Block Groups
  2007-09-19 11:34   ` Aneesh Kumar K.V
@ 2007-09-19 12:01     ` Valerie Clement
  0 siblings, 0 replies; 12+ messages in thread
From: Valerie Clement @ 2007-09-19 12:01 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: Avantika Mathur, linux-ext4, Andreas Dilger, cmm

Aneesh Kumar K.V wrote:
> the variable free is confusingly named here. It is not the free inode 
> count.
> rather it indicate the last used relative inode number in the group. How 
> about
> the below ?
> 
> -aneesh

I agree, it's better.
   Valérie

> 
> diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
> index 4250c02..cfe2e09 100644
> --- a/fs/ext4/ialloc.c
> +++ b/fs/ext4/ialloc.c
> @@ -633,17 +633,21 @@ got:
>     /* If we didn't allocate from within the initialized part of the inode
>      * table then we need to initialize up to this inode. */
>     if (EXT4_HAS_RO_COMPAT_FEATURE(sb, EXT4_FEATURE_RO_COMPAT_GDT_CSUM)) {
> -        if (gdp->bg_flags & cpu_to_le16(EXT4_BG_INODE_UNINIT)) {
> +        if (gdp->bg_flags & cpu_to_le16(EXT4_BG_INODE_UNINIT))
>             gdp->bg_flags &= cpu_to_le16(~EXT4_BG_INODE_UNINIT);
> -            free = EXT4_INODES_PER_GROUP(sb);
> -        } else {
> -            free = EXT4_INODES_PER_GROUP(sb) -
> -                le16_to_cpu(gdp->bg_itable_unused);
> -        }
> 
> -        if (ino > free)
> +        /*
> +         * Check the relative inode number against the last used
> +         * relative inode number in this group. if it is greater
> +         * we need to  update the bg_itable_unused count
> +         *
> +         */
> +        if (ino > (EXT4_INODES_PER_GROUP(sb) -
> +                    le16_to_cpu(gdp->bg_itable_unused))) {
> +
>             gdp->bg_itable_unused =
>                 cpu_to_le16(EXT4_INODES_PER_GROUP(sb) - ino);
> +        }
>     }
> 
>     gdp->bg_free_inodes_count =
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] Ext4: Uninitialized Block Groups
  2007-09-19  0:25 [PATCH] Ext4: Uninitialized Block Groups Avantika Mathur
  2007-09-19  3:03 ` Andrew Morton
  2007-09-19  3:05 ` Andrew Morton
@ 2007-09-19 12:06 ` Valerie Clement
  2007-09-19 11:34   ` Aneesh Kumar K.V
  2007-09-19 16:42   ` Andreas Dilger
  2007-09-20 23:22 ` Andrew Morton
  3 siblings, 2 replies; 12+ messages in thread
From: Valerie Clement @ 2007-09-19 12:06 UTC (permalink / raw)
  To: Avantika Mathur; +Cc: linux-ext4, Andreas Dilger, cmm

Hi Avantika,
I ran some tests with the uninit_groups feature enabled and got error messages
when running e2fsck on my ext4 partition. e2fsck complains of an "invalid 
unused inodes count" in some group descriptors.
These errors occur when checking groups which have only one inode in use. The 
"free inodes" count has been decremented by one in these groups but not the 
"unused inodes" count.

The following patch fixes the problem.

Andreas, could you check if my patch is correct?
Thanks a lot,
  Valérie

Index: linux-2.6.23-rc6/fs/ext4/ialloc.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/ext4/ialloc.c	2007-09-19 11:31:01.000000000 +0200
+++ linux-2.6.23-rc6/fs/ext4/ialloc.c	2007-09-19 11:31:41.000000000 +0200
@@ -633,13 +633,10 @@ got:
 	/* If we didn't allocate from within the initialized part of the inode
 	 * table then we need to initialize up to this inode. */
 	if (EXT4_HAS_RO_COMPAT_FEATURE(sb, EXT4_FEATURE_RO_COMPAT_GDT_CSUM)) {
-		if (gdp->bg_flags & cpu_to_le16(EXT4_BG_INODE_UNINIT)) {
+		if (gdp->bg_flags & cpu_to_le16(EXT4_BG_INODE_UNINIT))
 			gdp->bg_flags &= cpu_to_le16(~EXT4_BG_INODE_UNINIT);
-			free = EXT4_INODES_PER_GROUP(sb);
-		} else {
-			free = EXT4_INODES_PER_GROUP(sb) -
+		free = EXT4_INODES_PER_GROUP(sb) -
 				le16_to_cpu(gdp->bg_itable_unused);
-		}
 
 		if (ino > free)
 			gdp->bg_itable_unused =

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] Ext4: Uninitialized Block Groups
  2007-09-19 12:06 ` Valerie Clement
  2007-09-19 11:34   ` Aneesh Kumar K.V
@ 2007-09-19 16:42   ` Andreas Dilger
  2007-09-19 19:19     ` Avantika Mathur
  1 sibling, 1 reply; 12+ messages in thread
From: Andreas Dilger @ 2007-09-19 16:42 UTC (permalink / raw)
  To: Valerie Clement; +Cc: Avantika Mathur, linux-ext4, cmm, kalpak, girish

On Sep 19, 2007  14:06 +0200, Valerie Clement wrote:
> I ran some tests with the uninit_groups feature enabled and got error messages
> when running e2fsck on my ext4 partition. e2fsck complains of an "invalid 
> unused inodes count" in some group descriptors.
> These errors occur when checking groups which have only one inode in use. The 
> "free inodes" count has been decremented by one in these groups but not the 
> "unused inodes" count.
> 
> Index: linux-2.6.23-rc6/fs/ext4/ialloc.c
> ===================================================================
> --- linux-2.6.23-rc6.orig/fs/ext4/ialloc.c	2007-09-19 11:31:01.000000000 +0200
> +++ linux-2.6.23-rc6/fs/ext4/ialloc.c	2007-09-19 11:31:41.000000000 +0200
> @@ -633,13 +633,10 @@ got:
>  	/* If we didn't allocate from within the initialized part of the inode
>  	 * table then we need to initialize up to this inode. */
>  	if (EXT4_HAS_RO_COMPAT_FEATURE(sb, EXT4_FEATURE_RO_COMPAT_GDT_CSUM)) {
> -		if (gdp->bg_flags & cpu_to_le16(EXT4_BG_INODE_UNINIT)) {
> +		if (gdp->bg_flags & cpu_to_le16(EXT4_BG_INODE_UNINIT))
>  			gdp->bg_flags &= cpu_to_le16(~EXT4_BG_INODE_UNINIT);
> -			free = EXT4_INODES_PER_GROUP(sb);
> -		} else {
> -			free = EXT4_INODES_PER_GROUP(sb) -
> +		free = EXT4_INODES_PER_GROUP(sb) -
>  				le16_to_cpu(gdp->bg_itable_unused);
> -		}

Hmm, this is indeed incorrect, but I'm not sure solution is the right one.
I guess in our testing we ran it for a long time and must have created
more than a single inode per group...

What about the following instead?  I think the assumption in the original
code is that "it's a new group, all the inodes are free", but that is not
correct - we want to make NONE of the inodes free initially so that
bt_itable_unused is recalculated below:

		if (gdp->bg_flags & cpu_to_le16(EXT4_BG_INODE_UNINIT)) {
 			gdp->bg_flags &= cpu_to_le16(~EXT4_BG_INODE_UNINIT);
-			free = EXT4_INODES_PER_GROUP(sb);
+			free = 0;
		} else {
			free = EXT4_INODES_PER_GROUP(sb) -
 				le16_to_cpu(gdp->bg_itable_unused);
		}

		 if (ino > free)
                       gdp->bg_itable_unused =
                                cpu_to_le16(EXT3_INODES_PER_GROUP(sb) - ino);

Still a bit uneasy about off-by-one errors here though.  Is "ino" 0 or
1 for the first inode in the group.  We might need to have a -1 in the
bg_itable_unused calculation still.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] Ext4: Uninitialized Block Groups
  2007-09-19 16:42   ` Andreas Dilger
@ 2007-09-19 19:19     ` Avantika Mathur
  0 siblings, 0 replies; 12+ messages in thread
From: Avantika Mathur @ 2007-09-19 19:19 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Valerie Clement, linux-ext4, cmm, kalpak, girish

Andreas Dilger wrote:
> On Sep 19, 2007  14:06 +0200, Valerie Clement wrote:
>   
>> I ran some tests with the uninit_groups feature enabled and got error messages
>> when running e2fsck on my ext4 partition. e2fsck complains of an "invalid 
>> unused inodes count" in some group descriptors.
>> These errors occur when checking groups which have only one inode in use. The 
>> "free inodes" count has been decremented by one in these groups but not the 
>> "unused inodes" count.
>>
>> Index: linux-2.6.23-rc6/fs/ext4/ialloc.c
>> ===================================================================
>> --- linux-2.6.23-rc6.orig/fs/ext4/ialloc.c	2007-09-19 11:31:01.000000000 +0200
>> +++ linux-2.6.23-rc6/fs/ext4/ialloc.c	2007-09-19 11:31:41.000000000 +0200
>> @@ -633,13 +633,10 @@ got:
>>  	/* If we didn't allocate from within the initialized part of the inode
>>  	 * table then we need to initialize up to this inode. */
>>  	if (EXT4_HAS_RO_COMPAT_FEATURE(sb, EXT4_FEATURE_RO_COMPAT_GDT_CSUM)) {
>> -		if (gdp->bg_flags & cpu_to_le16(EXT4_BG_INODE_UNINIT)) {
>> +		if (gdp->bg_flags & cpu_to_le16(EXT4_BG_INODE_UNINIT))
>>  			gdp->bg_flags &= cpu_to_le16(~EXT4_BG_INODE_UNINIT);
>> -			free = EXT4_INODES_PER_GROUP(sb);
>> -		} else {
>> -			free = EXT4_INODES_PER_GROUP(sb) -
>> +		free = EXT4_INODES_PER_GROUP(sb) -
>>  				le16_to_cpu(gdp->bg_itable_unused);
>> -		}
>>     
>
> Hmm, this is indeed incorrect, but I'm not sure solution is the right one.
> I guess in our testing we ran it for a long time and must have created
> more than a single inode per group...
>
> What about the following instead?  I think the assumption in the original
> code is that "it's a new group, all the inodes are free", but that is not
> correct - we want to make NONE of the inodes free initially so that
> bt_itable_unused is recalculated below:
>
> 		if (gdp->bg_flags & cpu_to_le16(EXT4_BG_INODE_UNINIT)) {
>  			gdp->bg_flags &= cpu_to_le16(~EXT4_BG_INODE_UNINIT);
> -			free = EXT4_INODES_PER_GROUP(sb);
> +			free = 0;
> 		} else {
> 			free = EXT4_INODES_PER_GROUP(sb) -
>  				le16_to_cpu(gdp->bg_itable_unused);
> 		}
>
> 		 if (ino > free)
>                        gdp->bg_itable_unused =
>                                 cpu_to_le16(EXT3_INODES_PER_GROUP(sb) - ino);
>
> Still a bit uneasy about off-by-one errors here though.  Is "ino" 0 or
> 1 for the first inode in the group.  We might need to have a -1 in the
> bg_itable_unused calculation still.
>   
ino is incremented right before this code, so the first inode in the 
group is represented by 1, not 0. So this fix looks good.
Aneesh has incorporated this fix and also removed the local crc16 code, 
I will be reposting his new patch to lkml.

thanks
Avantika

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] Ext4: Uninitialized Block Groups
  2007-09-19  6:30   ` Andreas Dilger
@ 2007-09-19 22:54     ` Avantika Mathur
  0 siblings, 0 replies; 12+ messages in thread
From: Avantika Mathur @ 2007-09-19 22:54 UTC (permalink / raw)
  To: Andrew Morton, Avantika Mathur, linux-kernel, linux-ext4, cmm,
	"Aneesh Kumar K.V" <

Andreas Dilger wrote:
> On Sep 18, 2007  20:03 -0700, Andrew Morton wrote:
>   
>> On Tue, 18 Sep 2007 17:25:31 -0700 Avantika Mathur <mathur@linux.vnet.ibm.com> wrote:
>>
>>     
>>> +#if !defined(CONFIG_CRC16)
>>> +/** CRC table for the CRC-16. The poly is 0x8005 (x16 + x15 + x2 + 1) */
>>> +__u16 const crc16_table[256] = { me
>>> +	0x0000, 0xC0C1, 0xC181, 0x0140, 0xC301, 0x03C0, 0x0280, 0xC241,
>>>       
>> That's rather sad.  A plain old "depends on" would be better.
>>     
>
> My bad.  We wrote this patch and it had to run on older kernels that might
> not even have lib/crc16.c (it was added around 2.6.15 or so, so e.g. RHEL4
> doesn't have it).  I forgot to remove it from the upstream submission,
> and since it didn't cause problems nobody else complained...
The incremental patch below removes the local crc16 code, and also resolves an
issue with properly updating bg_itable_unused when an inode is allocated in an 
unused block groups.

Thanks
Avantika

---
 fs/Kconfig       |    1 +
 fs/ext4/ialloc.c |    8 +++++++-
 fs/ext4/super.c  |   51 +--------------------------------------------------
 3 files changed, 9 insertions(+), 51 deletions(-)

Index: linux-2.6.23-rc6/fs/ext4/ialloc.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/ext4/ialloc.c	2007-09-19 15:38:21.000000000 -0700
+++ linux-2.6.23-rc6/fs/ext4/ialloc.c	2007-09-19 15:41:11.000000000 -0700
@@ -635,12 +635,18 @@
 	if (EXT4_HAS_RO_COMPAT_FEATURE(sb, EXT4_FEATURE_RO_COMPAT_GDT_CSUM)) {
 		if (gdp->bg_flags & cpu_to_le16(EXT4_BG_INODE_UNINIT)) {
 			gdp->bg_flags &= cpu_to_le16(~EXT4_BG_INODE_UNINIT);
-			free = EXT4_INODES_PER_GROUP(sb);
+			free = 0;
 		} else {
 			free = EXT4_INODES_PER_GROUP(sb) -
 				le16_to_cpu(gdp->bg_itable_unused);
 		}
 
+		/*
+		 * Check the relative inode number against the last used
+		 * relative inode number in this group. if it is greater
+		 * we need to  update the bg_itable_unused count
+		 *
+		 */
 		if (ino > free)
 			gdp->bg_itable_unused =
 				cpu_to_le16(EXT4_INODES_PER_GROUP(sb) - ino);
Index: linux-2.6.23-rc6/fs/ext4/super.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/ext4/super.c	2007-09-19 15:38:21.000000000 -0700
+++ linux-2.6.23-rc6/fs/ext4/super.c	2007-09-19 15:38:51.000000000 -0700
@@ -37,6 +37,7 @@
 #include <linux/quotaops.h>
 #include <linux/seq_file.h>
 #include <linux/log2.h>
+#include <linux/crc16.h>
 
 #include <asm/uaccess.h>
 
@@ -1248,56 +1249,6 @@
 	return res;
 }
 
-#if !defined(CONFIG_CRC16)
-/** CRC table for the CRC-16. The poly is 0x8005 (x16 + x15 + x2 + 1) */
-__u16 const crc16_table[256] = {
-	0x0000, 0xC0C1, 0xC181, 0x0140, 0xC301, 0x03C0, 0x0280, 0xC241,
-	0xC601, 0x06C0, 0x0780, 0xC741, 0x0500, 0xC5C1, 0xC481, 0x0440,
-	0xCC01, 0x0CC0, 0x0D80, 0xCD41, 0x0F00, 0xCFC1, 0xCE81, 0x0E40,
-	0x0A00, 0xCAC1, 0xCB81, 0x0B40, 0xC901, 0x09C0, 0x0880, 0xC841,
-	0xD801, 0x18C0, 0x1980, 0xD941, 0x1B00, 0xDBC1, 0xDA81, 0x1A40,
-	0x1E00, 0xDEC1, 0xDF81, 0x1F40, 0xDD01, 0x1DC0, 0x1C80, 0xDC41,
-	0x1400, 0xD4C1, 0xD581, 0x1540, 0xD701, 0x17C0, 0x1680, 0xD641,
-	0xD201, 0x12C0, 0x1380, 0xD341, 0x1100, 0xD1C1, 0xD081, 0x1040,
-	0xF001, 0x30C0, 0x3180, 0xF141, 0x3300, 0xF3C1, 0xF281, 0x3240,
-	0x3600, 0xF6C1, 0xF781, 0x3740, 0xF501, 0x35C0, 0x3480, 0xF441,
-	0x3C00, 0xFCC1, 0xFD81, 0x3D40, 0xFF01, 0x3FC0, 0x3E80, 0xFE41,
-	0xFA01, 0x3AC0, 0x3B80, 0xFB41, 0x3900, 0xF9C1, 0xF881, 0x3840,
-	0x2800, 0xE8C1, 0xE981, 0x2940, 0xEB01, 0x2BC0, 0x2A80, 0xEA41,
-	0xEE01, 0x2EC0, 0x2F80, 0xEF41, 0x2D00, 0xEDC1, 0xEC81, 0x2C40,
-	0xE401, 0x24C0, 0x2580, 0xE541, 0x2700, 0xE7C1, 0xE681, 0x2640,
-	0x2200, 0xE2C1, 0xE381, 0x2340, 0xE101, 0x21C0, 0x2080, 0xE041,
-	0xA001, 0x60C0, 0x6180, 0xA141, 0x6300, 0xA3C1, 0xA281, 0x6240,
-	0x6600, 0xA6C1, 0xA781, 0x6740, 0xA501, 0x65C0, 0x6480, 0xA441,
-	0x6C00, 0xACC1, 0xAD81, 0x6D40, 0xAF01, 0x6FC0, 0x6E80, 0xAE41,
-	0xAA01, 0x6AC0, 0x6B80, 0xAB41, 0x6900, 0xA9C1, 0xA881, 0x6840,
-	0x7800, 0xB8C1, 0xB981, 0x7940, 0xBB01, 0x7BC0, 0x7A80, 0xBA41,
-	0xBE01, 0x7EC0, 0x7F80, 0xBF41, 0x7D00, 0xBDC1, 0xBC81, 0x7C40,
-	0xB401, 0x74C0, 0x7580, 0xB541, 0x7700, 0xB7C1, 0xB681, 0x7640,
-	0x7200, 0xB2C1, 0xB381, 0x7340, 0xB101, 0x71C0, 0x7080, 0xB041,
-	0x5000, 0x90C1, 0x9181, 0x5140, 0x9301, 0x53C0, 0x5280, 0x9241,
-	0x9601, 0x56C0, 0x5780, 0x9741, 0x5500, 0x95C1, 0x9481, 0x5440,
-	0x9C01, 0x5CC0, 0x5D80, 0x9D41, 0x5F00, 0x9FC1, 0x9E81, 0x5E40,
-	0x5A00, 0x9AC1, 0x9B81, 0x5B40, 0x9901, 0x59C0, 0x5880, 0x9841,
-	0x8801, 0x48C0, 0x4980, 0x8941, 0x4B00, 0x8BC1, 0x8A81, 0x4A40,
-	0x4E00, 0x8EC1, 0x8F81, 0x4F40, 0x8D01, 0x4DC0, 0x4C80, 0x8C41,
-	0x4400, 0x84C1, 0x8581, 0x4540, 0x8701, 0x47C0, 0x4680, 0x8641,
-	0x8201, 0x42C0, 0x4380, 0x8341, 0x4100, 0x81C1, 0x8081, 0x4040
-};
-
-static inline __u16 crc16_byte(__u16 crc, const __u8 data)
-{
-	return (crc >> 8) ^ crc16_table[(crc ^ data) & 0xff];
-}
-
-__u16 crc16(__u16 crc, __u8 const *buffer, size_t len)
-{
-	while (len--)
-		crc = crc16_byte(crc, *buffer++);
-	return crc;
-}
-#endif
-
 __le16 ext4_group_desc_csum(struct ext4_sb_info *sbi, __u32 block_group,
 			    struct ext4_group_desc *gdp)
 {
Index: linux-2.6.23-rc6/fs/Kconfig
===================================================================
--- linux-2.6.23-rc6.orig/fs/Kconfig	2007-09-19 15:38:21.000000000 -0700
+++ linux-2.6.23-rc6/fs/Kconfig	2007-09-19 15:38:51.000000000 -0700
@@ -140,6 +140,7 @@
 	tristate "Ext4dev/ext4 extended fs support development (EXPERIMENTAL)"
 	depends on EXPERIMENTAL
 	select JBD2
+	select CRC16
 	help
 	  Ext4dev is a predecessor filesystem of the next generation
 	  extended fs ext4, based on ext3 filesystem code. It will be

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] Ext4: Uninitialized Block Groups
  2007-09-19  0:25 [PATCH] Ext4: Uninitialized Block Groups Avantika Mathur
                   ` (2 preceding siblings ...)
  2007-09-19 12:06 ` Valerie Clement
@ 2007-09-20 23:22 ` Andrew Morton
  3 siblings, 0 replies; 12+ messages in thread
From: Andrew Morton @ 2007-09-20 23:22 UTC (permalink / raw)
  To: Avantika Mathur; +Cc: linux-kernel, linux-ext4, Andreas Dilger, cmm

On Tue, 18 Sep 2007 17:25:31 -0700
Avantika Mathur <mathur@linux.vnet.ibm.com> wrote:

> In pass1 of e2fsck, every inode table in the fileystem is scanned and checked, 
> regardless of whether it is in use.  This is this the most time consuming part 
> of the filesystem check.  The unintialized block group feature can greatly 
> reduce e2fsck time by eliminating checking of uninitialized inodes.  
> 
> With this feature, there is a a high water mark of used inodes for each block 
> group.  Block and inode bitmaps can be uninitialized on disk via a flag in the
> group descriptor to avoid reading or scanning them at e2fsck time.  A checksum
> of each group descriptor is used to ensure that corruption in the group
> descriptor's bit flags does not cause incorrect operation.

This needed a few fixups due to conflicts with
ext2-ext3-ext4-add-block-bitmap-validation.patch but they were pretty
straightforward.  Please check that the result is OK.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH] Ext4: Uninitialized Block Groups
  2007-10-04  5:50               ` [PATCH] ext4: remove #ifdef CONFIG_EXT4_INDEX Theodore Ts'o
@ 2007-10-04  5:50                 ` Theodore Ts'o
  0 siblings, 0 replies; 12+ messages in thread
From: Theodore Ts'o @ 2007-10-04  5:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-ext4, Andreas Dilger, Avantika Mathur, Mingming Cao,
	Aneesh Kumar K.V

From: Andreas Dilger <adilger@clusterfs.com>

In pass1 of e2fsck, every inode table in the fileystem is scanned and checked,
regardless of whether it is in use.  This is this the most time consuming part
of the filesystem check.  The unintialized block group feature can greatly
reduce e2fsck time by eliminating checking of uninitialized inodes.

With this feature, there is a a high water mark of used inodes for each block
group.  Block and inode bitmaps can be uninitialized on disk via a flag in the
group descriptor to avoid reading or scanning them at e2fsck time.  A checksum
of each group descriptor is used to ensure that corruption in the group
descriptor's bit flags does not cause incorrect operation.

The feature is enabled through a mkfs option

	mke2fs /dev/ -O uninit_groups

A patch adding support for uninitialized block groups to e2fsprogs tools has
been posted to the linux-ext4 mailing list.

The patches have been stress tested with fsstress and fsx.  In performance
tests testing e2fsck time, we have seen that e2fsck time on ext3 grows
linearly with the total number of inodes in the filesytem.  In ext4 with the
uninitialized block groups feature, the e2fsck time is constant, based
solely on the number of used inodes rather than the total inode count.
Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
greatly reduce e2fsck time for users.  With performance improvement of 2-20
times, depending on how full the filesystem is.

The attached graph shows the major improvements in e2fsck times in filesystems
with a large total inode count, but few inodes in use.

In each group descriptor if we have

EXT4_BG_INODE_UNINIT set in bg_flags:
        Inode table is not initialized/used in this group. So we can skip
        the consistency check during fsck.
EXT4_BG_BLOCK_UNINIT set in bg_flags:
        No block in the group is used. So we can skip the block bitmap
        verification for this group.

We also add two new fields to group descriptor as a part of
uninitialized group patch.

        __le16  bg_itable_unused;       /* Unused inodes count */
        __le16  bg_checksum;            /* crc16(sb_uuid+group+desc) */


bg_itable_unused:

If we have EXT4_BG_INODE_UNINIT not set in bg_flags
then bg_itable_unused will give the offset within
the inode table till the inodes are used. This can be
used by fsck to skip list of inodes that are marked unused.


bg_checksum:
Now that we depend on bg_flags and bg_itable_unused to determine
the block and inode usage, we need to make sure group descriptor
is not corrupt. We add checksum to group descriptor to
detect corruption. If the descriptor is found to be corrupt, we
mark all the blocks and inodes in the group used.


Signed-off-by: Avantika Mathur <mathur@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 fs/Kconfig              |    1 +
 fs/ext4/balloc.c        |   92 ++++++++++++++++++++++++++++-
 fs/ext4/group.h         |   29 +++++++++
 fs/ext4/ialloc.c        |  146 ++++++++++++++++++++++++++++++++++++++++++++---
 fs/ext4/resize.c        |    2 +
 fs/ext4/super.c         |   47 +++++++++++++++
 include/linux/ext4_fs.h |   16 ++++-
 7 files changed, 317 insertions(+), 16 deletions(-)
 create mode 100644 fs/ext4/group.h

diff --git a/fs/Kconfig b/fs/Kconfig
index f9eed6d..97eef97 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -140,6 +140,7 @@ config EXT4DEV_FS
 	tristate "Ext4dev/ext4 extended fs support development (EXPERIMENTAL)"
 	depends on EXPERIMENTAL
 	select JBD2
+	select CRC16
 	help
 	  Ext4dev is a predecessor filesystem of the next generation
 	  extended fs ext4, based on ext3 filesystem code. It will be
diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index e53b4af..d1a8882 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -20,6 +20,7 @@
 #include <linux/quotaops.h>
 #include <linux/buffer_head.h>
 
+#include "group.h"
 /*
  * balloc.c contains the blocks allocation and deallocation routines
  */
@@ -42,6 +43,74 @@ void ext4_get_group_no_and_offset(struct super_block *sb, ext4_fsblk_t blocknr,
 
 }
 
+/* Initializes an uninitialized block bitmap if given, and returns the
+ * number of blocks free in the group. */
+unsigned ext4_init_block_bitmap(struct super_block *sb, struct buffer_head *bh,
+				int block_group, struct ext4_group_desc *gdp)
+{
+	unsigned long start;
+	int bit, bit_max;
+	unsigned free_blocks;
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+
+	if (bh) {
+		J_ASSERT_BH(bh, buffer_locked(bh));
+
+		/* If checksum is bad mark all blocks used to prevent allocation
+		 * essentially implementing a per-group read-only flag. */
+		if (!ext4_group_desc_csum_verify(sbi, block_group, gdp)) {
+			ext4_error(sb, __FUNCTION__,
+				   "Checksum bad for group %u\n", block_group);
+			gdp->bg_free_blocks_count = 0;
+			gdp->bg_free_inodes_count = 0;
+			gdp->bg_itable_unused = 0;
+			memset(bh->b_data, 0xff, sb->s_blocksize);
+			return 0;
+		}
+		memset(bh->b_data, 0, sb->s_blocksize);
+	}
+
+	/* Check for superblock and gdt backups in this group */
+	bit_max = ext4_bg_has_super(sb, block_group);
+
+	if (!EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_META_BG) ||
+	    block_group < le32_to_cpu(sbi->s_es->s_first_meta_bg) *
+			  sbi->s_desc_per_block) {
+		if (bit_max) {
+			bit_max += ext4_bg_num_gdb(sb, block_group);
+			bit_max +=
+				le16_to_cpu(sbi->s_es->s_reserved_gdt_blocks);
+		}
+	} else { /* For META_BG_BLOCK_GROUPS */
+		int group_rel = (block_group -
+				 le32_to_cpu(sbi->s_es->s_first_meta_bg)) %
+				EXT4_DESC_PER_BLOCK(sb);
+		if (group_rel == 0 || group_rel == 1 ||
+		    (group_rel == EXT4_DESC_PER_BLOCK(sb) - 1))
+			bit_max += 1;
+	}
+
+	/* Last and first groups are always initialized */
+	free_blocks = EXT4_BLOCKS_PER_GROUP(sb) - bit_max;
+
+	if (bh) {
+		for (bit = 0; bit < bit_max; bit++)
+			ext4_set_bit(bit, bh->b_data);
+
+		start = block_group * EXT4_BLOCKS_PER_GROUP(sb) +
+			le32_to_cpu(sbi->s_es->s_first_data_block);
+
+		/* Set bits for block and inode bitmaps, and inode table */
+		ext4_set_bit(ext4_block_bitmap(sb, gdp) - start, bh->b_data);
+		ext4_set_bit(ext4_inode_bitmap(sb, gdp) - start, bh->b_data);
+		for (bit = le32_to_cpu(gdp->bg_inode_table) - start,
+		     bit_max = bit + sbi->s_itb_per_group; bit < bit_max; bit++)
+			ext4_set_bit(bit, bh->b_data);
+	}
+
+	return free_blocks - sbi->s_itb_per_group - 2;
+}
+
 /*
  * The free blocks are managed by bitmaps.  A file system contains several
  * blocks groups.  Each group contains 1 bitmap block for blocks, 1 bitmap
@@ -110,16 +179,29 @@ struct ext4_group_desc * ext4_get_group_desc(struct super_block * sb,
  *
  * Return buffer_head on success or NULL in case of failure.
  */
-static struct buffer_head *
+struct buffer_head *
 read_block_bitmap(struct super_block *sb, unsigned int block_group)
 {
 	struct ext4_group_desc * desc;
 	struct buffer_head * bh = NULL;
 
-	desc = ext4_get_group_desc (sb, block_group, NULL);
+	desc = ext4_get_group_desc(sb, block_group, NULL);
 	if (!desc)
 		goto error_out;
-	bh = sb_bread(sb, ext4_block_bitmap(sb, desc));
+	if (desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
+		bh = sb_getblk(sb, ext4_block_bitmap(sb, desc));
+		if (!buffer_uptodate(bh)) {
+			lock_buffer(bh);
+			if (!buffer_uptodate(bh)) {
+				ext4_init_block_bitmap(sb, bh, block_group,
+						       desc);
+				set_buffer_uptodate(bh);
+			}
+			unlock_buffer(bh);
+		}
+	} else {
+		bh = sb_bread(sb, ext4_block_bitmap(sb,desc));
+	}
 	if (!bh)
 		ext4_error (sb, "read_block_bitmap",
 			    "Cannot read block bitmap - "
@@ -586,6 +668,7 @@ do_more:
 	desc->bg_free_blocks_count =
 		cpu_to_le16(le16_to_cpu(desc->bg_free_blocks_count) +
 			group_freed);
+	desc->bg_checksum = ext4_group_desc_csum(sbi, block_group, desc);
 	spin_unlock(sb_bgl_lock(sbi, block_group));
 	percpu_counter_mod(&sbi->s_freeblocks_counter, count);
 
@@ -1644,8 +1727,11 @@ allocated:
 			ret_block, goal_hits, goal_attempts);
 
 	spin_lock(sb_bgl_lock(sbi, group_no));
+	if (gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT))
+		gdp->bg_flags &= cpu_to_le16(~EXT4_BG_BLOCK_UNINIT);
 	gdp->bg_free_blocks_count =
 			cpu_to_le16(le16_to_cpu(gdp->bg_free_blocks_count)-num);
+	gdp->bg_checksum = ext4_group_desc_csum(sbi, group_no, gdp);
 	spin_unlock(sb_bgl_lock(sbi, group_no));
 	percpu_counter_mod(&sbi->s_freeblocks_counter, -num);
 
diff --git a/fs/ext4/group.h b/fs/ext4/group.h
new file mode 100644
index 0000000..9310979
--- /dev/null
+++ b/fs/ext4/group.h
@@ -0,0 +1,29 @@
+/*
+ *  linux/fs/ext4/group.h
+ *
+ * Copyright (C) 2007 Cluster File Systems, Inc
+ *
+ * Author: Andreas Dilger <adilger@clusterfs.com>
+ */
+
+#ifndef _LINUX_EXT4_GROUP_H
+#define _LINUX_EXT4_GROUP_H
+#if defined(CONFIG_CRC16)
+#include <linux/crc16.h>
+#endif
+
+extern __le16 ext4_group_desc_csum(struct ext4_sb_info *sbi, __u32 group,
+				   struct ext4_group_desc *gdp);
+extern int ext4_group_desc_csum_verify(struct ext4_sb_info *sbi, __u32 group,
+				       struct ext4_group_desc *gdp);
+struct buffer_head *read_block_bitmap(struct super_block *sb,
+				      unsigned int block_group);
+extern unsigned ext4_init_block_bitmap(struct super_block *sb,
+				       struct buffer_head *bh, int group,
+				       struct ext4_group_desc *desc);
+#define ext4_free_blocks_after_init(sb, group, desc)			\
+		ext4_init_block_bitmap(sb, NULL, group, desc)
+extern unsigned ext4_init_inode_bitmap(struct super_block *sb,
+				       struct buffer_head *bh, int group,
+				       struct ext4_group_desc *desc);
+#endif /* _LINUX_EXT4_GROUP_H */
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index b8b538d..1fa418c 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -28,6 +28,7 @@
 
 #include "xattr.h"
 #include "acl.h"
+#include "group.h"
 
 /*
  * ialloc.c contains the inodes allocation and deallocation routines
@@ -43,6 +44,52 @@
  * the free blocks count in the block.
  */
 
+/*
+ * To avoid calling the atomic setbit hundreds or thousands of times, we only
+ * need to use it within a single byte (to ensure we get endianness right).
+ * We can use memset for the rest of the bitmap as there are no other users.
+ */
+static void mark_bitmap_end(int start_bit, int end_bit, char *bitmap)
+{
+	int i;
+
+	if (start_bit >= end_bit)
+		return;
+
+	ext4_debug("mark end bits +%d through +%d used\n", start_bit, end_bit);
+	for (i = start_bit; i < ((start_bit + 7) & ~7UL); i++)
+		ext4_set_bit(i, bitmap);
+	if (i < end_bit)
+		memset(bitmap + (i >> 3), 0xff, (end_bit - i) >> 3);
+}
+
+/* Initializes an uninitialized inode bitmap */
+unsigned ext4_init_inode_bitmap(struct super_block *sb,
+				struct buffer_head *bh, int block_group,
+				struct ext4_group_desc *gdp)
+{
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+
+	J_ASSERT_BH(bh, buffer_locked(bh));
+
+	/* If checksum is bad mark all blocks and inodes use to prevent
+	 * allocation, essentially implementing a per-group read-only flag. */
+	if (!ext4_group_desc_csum_verify(sbi, block_group, gdp)) {
+		ext4_error(sb, __FUNCTION__, "Checksum bad for group %u\n",
+			   block_group);
+		gdp->bg_free_blocks_count = 0;
+		gdp->bg_free_inodes_count = 0;
+		gdp->bg_itable_unused = 0;
+		memset(bh->b_data, 0xff, sb->s_blocksize);
+		return 0;
+	}
+
+	memset(bh->b_data, 0, (EXT4_INODES_PER_GROUP(sb) + 7) / 8);
+	mark_bitmap_end(EXT4_INODES_PER_GROUP(sb), EXT4_BLOCKS_PER_GROUP(sb),
+			bh->b_data);
+
+	return EXT4_INODES_PER_GROUP(sb);
+}
 
 /*
  * Read the inode allocation bitmap for a given block_group, reading
@@ -59,8 +106,20 @@ read_inode_bitmap(struct super_block * sb, unsigned long block_group)
 	desc = ext4_get_group_desc(sb, block_group, NULL);
 	if (!desc)
 		goto error_out;
-
-	bh = sb_bread(sb, ext4_inode_bitmap(sb, desc));
+	if (desc->bg_flags & cpu_to_le16(EXT4_BG_INODE_UNINIT)) {
+		bh = sb_getblk(sb, ext4_inode_bitmap(sb, desc));
+		if (!buffer_uptodate(bh)) {
+			lock_buffer(bh);
+			if (!buffer_uptodate(bh)) {
+				ext4_init_inode_bitmap(sb, bh, block_group,
+						       desc);
+				set_buffer_uptodate(bh);
+			}
+			unlock_buffer(bh);
+		}
+	} else {
+		bh = sb_bread(sb, ext4_inode_bitmap(sb, desc));
+	}
 	if (!bh)
 		ext4_error(sb, "read_inode_bitmap",
 			    "Cannot read inode bitmap - "
@@ -169,6 +228,8 @@ void ext4_free_inode (handle_t *handle, struct inode * inode)
 			if (is_directory)
 				gdp->bg_used_dirs_count = cpu_to_le16(
 				  le16_to_cpu(gdp->bg_used_dirs_count) - 1);
+			gdp->bg_checksum = ext4_group_desc_csum(sbi,
+							block_group, gdp);
 			spin_unlock(sb_bgl_lock(sbi, block_group));
 			percpu_counter_inc(&sbi->s_freeinodes_counter);
 			if (is_directory)
@@ -438,7 +499,7 @@ struct inode *ext4_new_inode(handle_t *handle, struct inode * dir, int mode)
 	struct ext4_sb_info *sbi;
 	int err = 0;
 	struct inode *ret;
-	int i;
+	int i, free = 0;
 
 	/* Cannot create files in a deleted directory */
 	if (!dir || !dir->i_nlink)
@@ -520,11 +581,13 @@ repeat_in_this_group:
 	goto out;
 
 got:
-	ino += group * EXT4_INODES_PER_GROUP(sb) + 1;
-	if (ino < EXT4_FIRST_INO(sb) || ino > le32_to_cpu(es->s_inodes_count)) {
-		ext4_error (sb, "ext4_new_inode",
-			    "reserved inode or inode > inodes count - "
-			    "block_group = %d, inode=%lu", group, ino);
+	ino++;
+	if ((group == 0 && ino < EXT4_FIRST_INO(sb)) ||
+	    ino > EXT4_INODES_PER_GROUP(sb)) {
+		ext4_error(sb, __FUNCTION__,
+			   "reserved inode or inode > inodes count - "
+			   "block_group = %d, inode=%lu", group,
+			   ino + group * EXT4_INODES_PER_GROUP(sb));
 		err = -EIO;
 		goto fail;
 	}
@@ -532,13 +595,78 @@ got:
 	BUFFER_TRACE(bh2, "get_write_access");
 	err = ext4_journal_get_write_access(handle, bh2);
 	if (err) goto fail;
+
+	/* We may have to initialize the block bitmap if it isn't already */
+	if (EXT4_HAS_RO_COMPAT_FEATURE(sb, EXT4_FEATURE_RO_COMPAT_GDT_CSUM) &&
+	    gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
+		struct buffer_head *block_bh = read_block_bitmap(sb, group);
+
+		BUFFER_TRACE(block_bh, "get block bitmap access");
+		err = ext4_journal_get_write_access(handle, block_bh);
+		if (err) {
+			brelse(block_bh);
+			goto fail;
+		}
+
+		free = 0;
+		spin_lock(sb_bgl_lock(sbi, group));
+		/* recheck and clear flag under lock if we still need to */
+		if (gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
+			gdp->bg_flags &= cpu_to_le16(~EXT4_BG_BLOCK_UNINIT);
+			free = ext4_free_blocks_after_init(sb, group, gdp);
+			gdp->bg_free_blocks_count = cpu_to_le16(free);
+		}
+		spin_unlock(sb_bgl_lock(sbi, group));
+
+		/* Don't need to dirty bitmap block if we didn't change it */
+		if (free) {
+			BUFFER_TRACE(block_bh, "dirty block bitmap");
+			err = ext4_journal_dirty_metadata(handle, block_bh);
+		}
+
+		brelse(block_bh);
+		if (err)
+			goto fail;
+	}
+
 	spin_lock(sb_bgl_lock(sbi, group));
+	/* If we didn't allocate from within the initialized part of the inode
+	 * table then we need to initialize up to this inode. */
+	if (EXT4_HAS_RO_COMPAT_FEATURE(sb, EXT4_FEATURE_RO_COMPAT_GDT_CSUM)) {
+		if (gdp->bg_flags & cpu_to_le16(EXT4_BG_INODE_UNINIT)) {
+			gdp->bg_flags &= cpu_to_le16(~EXT4_BG_INODE_UNINIT);
+
+			/* When marking the block group with
+			 * ~EXT4_BG_INODE_UNINIT we don't want to depend
+			 * on the value of bg_itable_unsed even though
+			 * mke2fs could have initialized the same for us.
+			 * Instead we calculated the value below
+			 */
+
+			free = 0;
+		} else {
+			free = EXT4_INODES_PER_GROUP(sb) -
+				le16_to_cpu(gdp->bg_itable_unused);
+		}
+
+		/*
+		 * Check the relative inode number against the last used
+		 * relative inode number in this group. if it is greater
+		 * we need to  update the bg_itable_unused count
+		 *
+		 */
+		if (ino > free)
+			gdp->bg_itable_unused =
+				cpu_to_le16(EXT4_INODES_PER_GROUP(sb) - ino);
+	}
+
 	gdp->bg_free_inodes_count =
 		cpu_to_le16(le16_to_cpu(gdp->bg_free_inodes_count) - 1);
 	if (S_ISDIR(mode)) {
 		gdp->bg_used_dirs_count =
 			cpu_to_le16(le16_to_cpu(gdp->bg_used_dirs_count) + 1);
 	}
+	gdp->bg_checksum = ext4_group_desc_csum(sbi, group, gdp);
 	spin_unlock(sb_bgl_lock(sbi, group));
 	BUFFER_TRACE(bh2, "call ext4_journal_dirty_metadata");
 	err = ext4_journal_dirty_metadata(handle, bh2);
@@ -560,7 +688,7 @@ got:
 		inode->i_gid = current->fsgid;
 	inode->i_mode = mode;
 
-	inode->i_ino = ino;
+	inode->i_ino = ino + group * EXT4_INODES_PER_GROUP(sb);
 	/* This is the optimal IO size (for stat), not the fs block size */
 	inode->i_blocks = 0;
 	inode->i_mtime = inode->i_atime = inode->i_ctime = ei->i_crtime =
diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c
index aa11d7d..3359450 100644
--- a/fs/ext4/resize.c
+++ b/fs/ext4/resize.c
@@ -16,6 +16,7 @@
 #include <linux/errno.h>
 #include <linux/slab.h>
 
+#include "group.h"
 
 #define outside(b, first, last)	((b) < (first) || (b) >= (last))
 #define inside(b, first, last)	((b) >= (first) && (b) < (last))
@@ -842,6 +843,7 @@ int ext4_group_add(struct super_block *sb, struct ext4_new_group_data *input)
 	ext4_inode_table_set(sb, gdp, input->inode_table); /* LV FIXME */
 	gdp->bg_free_blocks_count = cpu_to_le16(input->free_blocks_count);
 	gdp->bg_free_inodes_count = cpu_to_le16(EXT4_INODES_PER_GROUP(sb));
+	gdp->bg_checksum = ext4_group_desc_csum(sbi, input->group, gdp);
 
 	/*
 	 * Make the new blocks and inodes valid next.  We do this before
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 420d39d..b59610d 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -37,12 +37,14 @@
 #include <linux/quotaops.h>
 #include <linux/seq_file.h>
 #include <linux/log2.h>
+#include <linux/crc16.h>
 
 #include <asm/uaccess.h>
 
 #include "xattr.h"
 #include "acl.h"
 #include "namei.h"
+#include "group.h"
 
 static int ext4_load_journal(struct super_block *, struct ext4_super_block *,
 			     unsigned long journal_devnum);
@@ -1237,6 +1239,43 @@ static int ext4_setup_super(struct super_block *sb, struct ext4_super_block *es,
 	return res;
 }
 
+__le16 ext4_group_desc_csum(struct ext4_sb_info *sbi, __u32 block_group,
+			    struct ext4_group_desc *gdp)
+{
+	__u16 crc = 0;
+
+	if (sbi->s_es->s_feature_ro_compat &
+	    cpu_to_le32(EXT4_FEATURE_RO_COMPAT_GDT_CSUM)) {
+		int offset = offsetof(struct ext4_group_desc, bg_checksum);
+		__le32 le_group = cpu_to_le32(block_group);
+
+		crc = crc16(~0, sbi->s_es->s_uuid, sizeof(sbi->s_es->s_uuid));
+		crc = crc16(crc, (__u8 *)&le_group, sizeof(le_group));
+		crc = crc16(crc, (__u8 *)gdp, offset);
+		offset += sizeof(gdp->bg_checksum); /* skip checksum */
+		/* for checksum of struct ext4_group_desc do the rest...*/
+		if ((sbi->s_es->s_feature_incompat &
+		     cpu_to_le32(EXT4_FEATURE_INCOMPAT_64BIT)) &&
+		    offset < le16_to_cpu(sbi->s_es->s_desc_size))
+			crc = crc16(crc, (__u8 *)gdp + offset,
+				    le16_to_cpu(sbi->s_es->s_desc_size) -
+					offset);
+	}
+
+	return cpu_to_le16(crc);
+}
+
+int ext4_group_desc_csum_verify(struct ext4_sb_info *sbi, __u32 block_group,
+				struct ext4_group_desc *gdp)
+{
+	if ((sbi->s_es->s_feature_ro_compat &
+	     cpu_to_le32(EXT4_FEATURE_RO_COMPAT_GDT_CSUM)) &&
+	    (gdp->bg_checksum != ext4_group_desc_csum(sbi, block_group, gdp)))
+		return 0;
+
+	return 1;
+}
+
 /* Called at mount-time, super-block is locked */
 static int ext4_check_descriptors (struct super_block * sb)
 {
@@ -1291,6 +1330,14 @@ static int ext4_check_descriptors (struct super_block * sb)
 				    i, inode_table);
 			return 0;
 		}
+		if (!ext4_group_desc_csum_verify(sbi, i, gdp)) {
+			ext4_error(sb, __FUNCTION__,
+				   "Checksum for group %d failed (%u!=%u)\n", i,
+				   le16_to_cpu(ext4_group_desc_csum(sbi, i,
+								    gdp)),
+				   le16_to_cpu(gdp->bg_checksum));
+			return 0;
+		}
 		first_block += EXT4_BLOCKS_PER_GROUP(sb);
 		gdp = (struct ext4_group_desc *)
 			((__u8 *)gdp + EXT4_DESC_SIZE(sb));
diff --git a/include/linux/ext4_fs.h b/include/linux/ext4_fs.h
index 151738a..b77b59f 100644
--- a/include/linux/ext4_fs.h
+++ b/include/linux/ext4_fs.h
@@ -105,19 +105,25 @@
  */
 struct ext4_group_desc
 {
-	__le32	bg_block_bitmap;		/* Blocks bitmap block */
-	__le32	bg_inode_bitmap;		/* Inodes bitmap block */
+	__le32	bg_block_bitmap;	/* Blocks bitmap block */
+	__le32	bg_inode_bitmap;	/* Inodes bitmap block */
 	__le32	bg_inode_table;		/* Inodes table block */
 	__le16	bg_free_blocks_count;	/* Free blocks count */
 	__le16	bg_free_inodes_count;	/* Free inodes count */
 	__le16	bg_used_dirs_count;	/* Directories count */
-	__u16	bg_flags;
-	__u32	bg_reserved[3];
+	__le16	bg_flags;		/* EXT4_BG_flags (INODE_UNINIT, etc) */
+	__u32	bg_reserved[2];		/* Likely block/inode bitmap checksum */
+	__le16  bg_itable_unused;	/* Unused inodes count */
+	__le16  bg_checksum;		/* crc16(sb_uuid+group+desc) */
 	__le32	bg_block_bitmap_hi;	/* Blocks bitmap block MSB */
 	__le32	bg_inode_bitmap_hi;	/* Inodes bitmap block MSB */
 	__le32	bg_inode_table_hi;	/* Inodes table block MSB */
 };
 
+#define EXT4_BG_INODE_UNINIT	0x0001 /* Inode table/bitmap not in use */
+#define EXT4_BG_BLOCK_UNINIT	0x0002 /* Block bitmap not in use */
+#define EXT4_BG_INODE_ZEROED	0x0004 /* On-disk itable initialized to zero */
+
 #ifdef __KERNEL__
 #include <linux/ext4_fs_i.h>
 #include <linux/ext4_fs_sb.h>
@@ -665,6 +671,7 @@ static inline int ext4_valid_inum(struct super_block *sb, unsigned long ino)
 #define EXT4_FEATURE_RO_COMPAT_SPARSE_SUPER	0x0001
 #define EXT4_FEATURE_RO_COMPAT_LARGE_FILE	0x0002
 #define EXT4_FEATURE_RO_COMPAT_BTREE_DIR	0x0004
+#define EXT4_FEATURE_RO_COMPAT_GDT_CSUM		0x0010
 #define EXT4_FEATURE_RO_COMPAT_DIR_NLINK	0x0020
 #define EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE	0x0040
 
@@ -684,6 +691,7 @@ static inline int ext4_valid_inum(struct super_block *sb, unsigned long ino)
 					 EXT4_FEATURE_INCOMPAT_64BIT)
 #define EXT4_FEATURE_RO_COMPAT_SUPP	(EXT4_FEATURE_RO_COMPAT_SPARSE_SUPER| \
 					 EXT4_FEATURE_RO_COMPAT_LARGE_FILE| \
+					 EXT4_FEATURE_RO_COMPAT_GDT_CSUM| \
 					 EXT4_FEATURE_RO_COMPAT_DIR_NLINK | \
 					 EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE | \
 					 EXT4_FEATURE_RO_COMPAT_BTREE_DIR)
-- 
1.5.3.2.81.g17ed

^ permalink raw reply related	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2007-10-04  5:51 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-09-19  0:25 [PATCH] Ext4: Uninitialized Block Groups Avantika Mathur
2007-09-19  3:03 ` Andrew Morton
2007-09-19  6:30   ` Andreas Dilger
2007-09-19 22:54     ` Avantika Mathur
2007-09-19  3:05 ` Andrew Morton
2007-09-19 12:06 ` Valerie Clement
2007-09-19 11:34   ` Aneesh Kumar K.V
2007-09-19 12:01     ` Valerie Clement
2007-09-19 16:42   ` Andreas Dilger
2007-09-19 19:19     ` Avantika Mathur
2007-09-20 23:22 ` Andrew Morton
  -- strict thread matches above, loose matches on Subject: below --
2007-10-04  5:50 [PATCH, RFC] Ext4 patches planned for submission upstream Theodore Ts'o
2007-10-04  5:50 ` [PATCH] jbd/jbd2: JBD memory allocation cleanups Theodore Ts'o
2007-10-04  5:50   ` [PATCH] jbd/jbd2: Journal initialization doesn't need __GFP_NOFAIL Theodore Ts'o
2007-10-04  5:50     ` [PATCH] JBD2/Ext4: Convert kmalloc to kzalloc in jbd2/ext4 Theodore Ts'o
     [not found]       ` <1191477059-5357-5-git-send-email-tytso@mit.edu>
2007-10-04  5:50         ` [PATCH] jbd2: fix commit code to properly abort journal Theodore Ts'o
2007-10-04  5:50           ` [PATCH] JBD2: debug code cleanup Theodore Ts'o
2007-10-04  5:50             ` [PATCH] Once ext4 will not implement fragment, it is believed it will never be Theodore Ts'o
2007-10-04  5:50               ` [PATCH] ext4: remove #ifdef CONFIG_EXT4_INDEX Theodore Ts'o
2007-10-04  5:50                 ` [PATCH] Ext4: Uninitialized Block Groups Theodore Ts'o

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).