[PATCH 00/12] e2fsprogs mke2fs optimizations and new features

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00/12] e2fsprogs mke2fs optimizations and new features
@ 2014-01-20  5:54 Theodore Ts'o
  2014-01-20  5:54 ` [PATCH 01/12] libext2fs: fix off-by-one bug in ext2fs_extent_insert() Theodore Ts'o
                   ` (12 more replies)
  0 siblings, 13 replies; 24+ messages in thread
From: Theodore Ts'o @ 2014-01-20  5:54 UTC (permalink / raw)
  To: Ext4 Developers List; +Cc: Theodore Ts'o

Here is the latest version of the patches I've been working on optimize
creating very large file systems, and allowing the file system to be
created with the metadata blocks located at the beginning of the device,
so the data block area can be made maximally contiguous.

An example of the optimizations and new features added in this patch
set:

% cp /dev/null /tmp/foo.img
% mke2fs -F -t ext4 -O sparse_super2 -E num_backup_sb=0,packed_meta_blocks /tmp/foo.img 64T


Theodore Ts'o (12):
  libext2fs: fix off-by-one bug in ext2fs_extent_insert()
  libext2fs: clean up generic handling of
    ext2fs_find_first_{set,zero}_*()
  libext2fs: build tst_bitmaps with rep invariants checking enabled
  libext: optimize find_first_set() for bitarray-based bitmaps
  libext2fs: optimize find_first_{zero,set}() for red-black tree based
    bitmaps
  libext2fs: further clean up and rename check_block_uninit
  libext2fs: add ext2fs_block_alloc_stats_range()
  libext2fs: optimize ext2fs_allocate_group_table()
  libext2: optimize ext2fs_new_block2()
  mke2fs: optimize fix_cluster_bg_counts()
  Add support for new compat feature "sparse_super2"
  mke2fs: allow metadata blocks to be at the beginning of the file
    system

 debugfs/set_fields.c        |   2 +
 lib/e2p/feature.c           |   2 +
 lib/e2p/ls.c                |   8 ++
 lib/ext2fs/Makefile.in      |   7 +-
 lib/ext2fs/alloc.c          |  65 +++++-----------
 lib/ext2fs/alloc_stats.c    |  41 ++++++++++
 lib/ext2fs/alloc_tables.c   |  22 ++----
 lib/ext2fs/blkmap64_ba.c    |  83 ++++++++++++++++++--
 lib/ext2fs/blkmap64_rb.c    | 103 ++++++++++++++++++++++++-
 lib/ext2fs/closefs.c        |  12 ++-
 lib/ext2fs/ext2_fs.h        |   4 +-
 lib/ext2fs/ext2fs.h         |   6 +-
 lib/ext2fs/extent.c         |   4 +-
 lib/ext2fs/gen_bitmap64.c   |  66 ++++++++--------
 lib/ext2fs/initialize.c     |   2 +
 lib/ext2fs/mkjournal.c      |   5 +-
 lib/ext2fs/res_gdt.c        |  13 ++++
 lib/ext2fs/swapfs.c         |   2 +
 lib/ext2fs/tst_super_size.c |   3 +-
 misc/ext4.5.in              |  11 +++
 misc/mke2fs.8.in            |  20 +++++
 misc/mke2fs.c               | 136 ++++++++++++++++++++++++++++-----
 misc/mke2fs.conf.5.in       |   9 +++
 resize/online.c             |   8 ++
 resize/resize2fs.c          | 182 +++++++++++++++++++++++++++++++++++++++++++-
 25 files changed, 688 insertions(+), 128 deletions(-)

-- 
1.8.5.rc3.362.gdf10213


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 01/12] libext2fs: fix off-by-one bug in ext2fs_extent_insert()
  2014-01-20  5:54 [PATCH 00/12] e2fsprogs mke2fs optimizations and new features Theodore Ts'o
@ 2014-01-20  5:54 ` Theodore Ts'o
  2014-01-20  5:54 ` [PATCH 02/12] libext2fs: clean up generic handling of ext2fs_find_first_{set,zero}_*() Theodore Ts'o
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 24+ messages in thread
From: Theodore Ts'o @ 2014-01-20  5:54 UTC (permalink / raw)
  To: Ext4 Developers List; +Cc: Theodore Ts'o

When inserting the first extent into an empty inode, the
ext2fs_extent_insert() leaves path->left set to 1 instead of 0.  Since
path->curr is pointing at the last (only) extent in the file,
path->left should be 0.

This is mostly harmless, and gets corrected fairly quickly if the
calling applicaton jumps to a different part of the extent tree ---
for example, by calling ext2fs_extent_goto(), or calling
ext2fs_extent_get with the flags argument set to EXT2_EXTENT_ROOT.
Which is why we hadn't noticed this problem until now.

However, if you insert four extents using ext2fs_extent_insert, the
fourth insert will end up copying too many bytes in the i_block[]
array, since path->left is one larger than it should be.  This results
in the inode fields i_generation, i_file_acl, and i_size_high getting
zeroed out.

This problem can be replicated as follows:

% cp /dev/null /tmp/foo.img
% mke2fs -F -t ext4 /tmp/foo.img 100
% debugfs -w /tmp/foo.img
debugfs: write /dev/null foo
debugfs: set_inode_field foo i_size_hi 1
debugfs: stat foo
 <----- note that the inode's size is 4294967296
debugfs: extent_open foo
debugfs (extent ino 12): insert --after 0 1 100
debugfs (extent ino 12): insert --after 1 1 101
debugfs (extent ino 12): insert --after 2 1 102
debugfs (extent ino 12): insert --after 3 1 103
debugfs (extent ino 12): extent_close
debugfs: stat foo
 <----- note that the inode's size is now 0
debugfs: quit

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
---
 lib/ext2fs/extent.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/lib/ext2fs/extent.c b/lib/ext2fs/extent.c
index 5cdc2e4..6f4f1d2 100644
--- a/lib/ext2fs/extent.c
+++ b/lib/ext2fs/extent.c
@@ -1092,8 +1092,10 @@ errcode_t ext2fs_extent_insert(ext2_extent_handle_t handle, int flags,
 			ix++;
 			path->left--;
 		}
-	} else
+	} else {
 		ix = EXT_FIRST_INDEX(eh);
+		path->left = -1;
+	}

 	path->curr = ix;

-- 
1.8.5.rc3.362.gdf10213

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 02/12] libext2fs: clean up generic handling of ext2fs_find_first_{set,zero}_*()
  2014-01-20  5:54 [PATCH 00/12] e2fsprogs mke2fs optimizations and new features Theodore Ts'o
  2014-01-20  5:54 ` [PATCH 01/12] libext2fs: fix off-by-one bug in ext2fs_extent_insert() Theodore Ts'o
@ 2014-01-20  5:54 ` Theodore Ts'o
  2014-01-20  5:54 ` [PATCH 03/12] libext2fs: build tst_bitmaps with rep invariants checking enabled Theodore Ts'o
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 24+ messages in thread
From: Theodore Ts'o @ 2014-01-20  5:54 UTC (permalink / raw)
  To: Ext4 Developers List; +Cc: Theodore Ts'o

Move the error checking into the the generic bitmap code, and add
support for bitmaps with cluster_bits set.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
---
 lib/ext2fs/blkmap64_ba.c  |  6 -----
 lib/ext2fs/gen_bitmap64.c | 66 ++++++++++++++++++++++++++---------------------
 2 files changed, 36 insertions(+), 36 deletions(-)

diff --git a/lib/ext2fs/blkmap64_ba.c b/lib/ext2fs/blkmap64_ba.c
index 8eddde9..284236d 100644
--- a/lib/ext2fs/blkmap64_ba.c
+++ b/lib/ext2fs/blkmap64_ba.c
@@ -328,12 +328,6 @@ static errcode_t ba_find_first_zero(ext2fs_generic_bitmap bitmap,
 	const unsigned char *pos;
 	unsigned long max_loop_count, i;
 
-	if (start < bitmap->start || end > bitmap->end || start > end)
-		return EINVAL;
-
-	if (bitmap->cluster_bits)
-		return EINVAL;
-
 	/* scan bits until we hit a byte boundary */
 	while ((bitpos & 0x7) != 0 && count > 0) {
 		if (!ext2fs_test_bit64(bitpos, bp->bitarray)) {
diff --git a/lib/ext2fs/gen_bitmap64.c b/lib/ext2fs/gen_bitmap64.c
index fcf63ad..9615f1e 100644
--- a/lib/ext2fs/gen_bitmap64.c
+++ b/lib/ext2fs/gen_bitmap64.c
@@ -801,17 +801,14 @@ errcode_t ext2fs_find_first_zero_generic_bmap(ext2fs_generic_bitmap bitmap,
 					      __u64 start, __u64 end, __u64 *out)
 {
 	int b;
+	__u64 cstart, cend, cout;
+	errcode_t retval;
 
 	if (!bitmap)
 		return EINVAL;
 
-	if (EXT2FS_IS_64_BITMAP(bitmap) && bitmap->bitmap_ops->find_first_zero)
-		return bitmap->bitmap_ops->find_first_zero(bitmap, start,
-							   end, out);
-
 	if (EXT2FS_IS_32_BITMAP(bitmap)) {
 		blk_t blk = 0;
-		errcode_t retval;
 
 		if (((start) & ~0xffffffffULL) ||
 		    ((end) & ~0xffffffffULL)) {
@@ -829,23 +826,29 @@ errcode_t ext2fs_find_first_zero_generic_bmap(ext2fs_generic_bitmap bitmap,
 	if (!EXT2FS_IS_64_BITMAP(bitmap))
 		return EINVAL;
 
-	start >>= bitmap->cluster_bits;
-	end >>= bitmap->cluster_bits;
+	cstart = start >> bitmap->cluster_bits;
+	cend = end >> bitmap->cluster_bits;
 
-	if (start < bitmap->start || end > bitmap->end || start > end) {
+	if (cstart < bitmap->start || cend > bitmap->end || start > end) {
 		warn_bitmap(bitmap, EXT2FS_TEST_ERROR, start);
 		return EINVAL;
 	}
 
-	while (start <= end) {
-		b = bitmap->bitmap_ops->test_bmap(bitmap, start);
-		if (!b) {
-			*out = start << bitmap->cluster_bits;
-			return 0;
-		}
-		start++;
+	if (bitmap->bitmap_ops->find_first_zero) {
+		retval = bitmap->bitmap_ops->find_first_zero(bitmap, cstart,
+							     cend, &cout);
+		if (retval)
+			return retval;
+	found:
+		cout <<= bitmap->cluster_bits;
+		*out = (cout >= start) ? cout : start;
+		return 0;
 	}
 
+	for (cout = cstart; cout <= cend; cout++)
+		if (!bitmap->bitmap_ops->test_bmap(bitmap, cout))
+			goto found;
+
 	return ENOENT;
 }
 
@@ -853,17 +856,14 @@ errcode_t ext2fs_find_first_set_generic_bmap(ext2fs_generic_bitmap bitmap,
 					     __u64 start, __u64 end, __u64 *out)
 {
 	int b;
+	__u64 cstart, cend, cout;
+	errcode_t retval;
 
 	if (!bitmap)
 		return EINVAL;
 
-	if (EXT2FS_IS_64_BITMAP(bitmap) && bitmap->bitmap_ops->find_first_set)
-		return bitmap->bitmap_ops->find_first_set(bitmap, start,
-							  end, out);
-
 	if (EXT2FS_IS_32_BITMAP(bitmap)) {
 		blk_t blk = 0;
-		errcode_t retval;
 
 		if (((start) & ~0xffffffffULL) ||
 		    ((end) & ~0xffffffffULL)) {
@@ -881,22 +881,28 @@ errcode_t ext2fs_find_first_set_generic_bmap(ext2fs_generic_bitmap bitmap,
 	if (!EXT2FS_IS_64_BITMAP(bitmap))
 		return EINVAL;
 
-	start >>= bitmap->cluster_bits;
-	end >>= bitmap->cluster_bits;
+	cstart = start >> bitmap->cluster_bits;
+	cend = end >> bitmap->cluster_bits;
 
-	if (start < bitmap->start || end > bitmap->end || start > end) {
+	if (cstart < bitmap->start || cend > bitmap->end || start > end) {
 		warn_bitmap(bitmap, EXT2FS_TEST_ERROR, start);
 		return EINVAL;
 	}
 
-	while (start <= end) {
-		b = bitmap->bitmap_ops->test_bmap(bitmap, start);
-		if (b) {
-			*out = start << bitmap->cluster_bits;
-			return 0;
-		}
-		start++;
+	if (bitmap->bitmap_ops->find_first_set) {
+		retval = bitmap->bitmap_ops->find_first_set(bitmap, cstart,
+							    cend, &cout);
+		if (retval)
+			return retval;
+	found:
+		cout <<= bitmap->cluster_bits;
+		*out = (cout >= start) ? cout : start;
+		return 0;
 	}
 
+	for (cout = cstart; cout <= cend; cout++)
+		if (bitmap->bitmap_ops->test_bmap(bitmap, cout))
+			goto found;
+
 	return ENOENT;
 }
-- 
1.8.5.rc3.362.gdf10213


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 03/12] libext2fs: build tst_bitmaps with rep invariants checking enabled
  2014-01-20  5:54 [PATCH 00/12] e2fsprogs mke2fs optimizations and new features Theodore Ts'o
  2014-01-20  5:54 ` [PATCH 01/12] libext2fs: fix off-by-one bug in ext2fs_extent_insert() Theodore Ts'o
  2014-01-20  5:54 ` [PATCH 02/12] libext2fs: clean up generic handling of ext2fs_find_first_{set,zero}_*() Theodore Ts'o
@ 2014-01-20  5:54 ` Theodore Ts'o
  2014-01-20  5:54 ` [PATCH 04/12] libext: optimize find_first_set() for bitarray-based bitmaps Theodore Ts'o
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 24+ messages in thread
From: Theodore Ts'o @ 2014-01-20  5:54 UTC (permalink / raw)
  To: Ext4 Developers List; +Cc: Theodore Ts'o

When building tst_bitmaps, enable #define DEBUG_RB, so we are
always testing the sanity of the in-memory representation of the
bitmap when using red-black trees as part of a "make check" run.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
---
 lib/ext2fs/Makefile.in   |  7 ++++---
 lib/ext2fs/blkmap64_rb.c | 14 +++++++++++---
 2 files changed, 15 insertions(+), 6 deletions(-)

diff --git a/lib/ext2fs/Makefile.in b/lib/ext2fs/Makefile.in
index ef55e08..090b89a 100644
--- a/lib/ext2fs/Makefile.in
+++ b/lib/ext2fs/Makefile.in
@@ -372,10 +372,11 @@ tst_bitmaps_cmd.c: tst_bitmaps_cmd.ct
 	$(E) "	MK_CMDS $@"
 	$(Q) DIR=$(srcdir) $(MK_CMDS) $(srcdir)/tst_bitmaps_cmd.ct
 
-tst_bitmaps: tst_bitmaps.o tst_bitmaps_cmd.o $(STATIC_LIBEXT2FS) \
-		$(DEPSTATIC_LIBSS) $(DEPSTATIC_LIBCOM_ERR)
+tst_bitmaps: tst_bitmaps.o tst_bitmaps_cmd.o $(srcdir)/blkmap64_rb.c \
+		$(STATIC_LIBEXT2FS) $(DEPSTATIC_LIBSS) $(DEPSTATIC_LIBCOM_ERR)
 	$(E) "	LD $@"
-	$(Q) $(CC) -o $@ tst_bitmaps.o tst_bitmaps_cmd.o $(ALL_CFLAGS) \
+	$(Q) $(CC) -o $@ tst_bitmaps.o tst_bitmaps_cmd.o \
+		-DDEBUG_RB $(srcdir)/blkmap64_rb.c $(ALL_CFLAGS) \
 		$(STATIC_LIBEXT2FS) $(STATIC_LIBSS) $(STATIC_LIBCOM_ERR) \
 		$(SYSLIBS)
 
diff --git a/lib/ext2fs/blkmap64_rb.c b/lib/ext2fs/blkmap64_rb.c
index a189590..0e0a217 100644
--- a/lib/ext2fs/blkmap64_rb.c
+++ b/lib/ext2fs/blkmap64_rb.c
@@ -135,7 +135,7 @@ err_out:
 }
 #else
 #define check_tree(root, msg) do {} while (0)
-#define print_tree(root, msg) do {} while (0)
+#define print_tree(root) do {} while (0)
 #endif
 
 static void rb_get_new_extent(struct bmap_rb_extent **ext, __u64 start,
@@ -569,11 +569,14 @@ static int rb_remove_extent(__u64 start, __u64 count,
 static int rb_mark_bmap(ext2fs_generic_bitmap bitmap, __u64 arg)
 {
 	struct ext2fs_rb_private *bp;
+	int retval;
 
 	bp = (struct ext2fs_rb_private *) bitmap->private;
 	arg -= bitmap->start;
 
-	return rb_insert_extent(arg, 1, bp);
+	retval = rb_insert_extent(arg, 1, bp);
+	check_tree(&bp->root, __func__);
+	return retval;
 }
 
 static int rb_unmark_bmap(ext2fs_generic_bitmap bitmap, __u64 arg)
@@ -610,6 +613,7 @@ static void rb_mark_bmap_extent(ext2fs_generic_bitmap bitmap, __u64 arg,
 	arg -= bitmap->start;
 
 	rb_insert_extent(arg, num, bp);
+	check_tree(&bp->root, __func__);
 }
 
 static void rb_unmark_bmap_extent(ext2fs_generic_bitmap bitmap, __u64 arg,
@@ -714,11 +718,14 @@ static errcode_t rb_set_bmap_range(ext2fs_generic_bitmap bitmap,
 
 		rb_insert_extent(start + first_set - bitmap->start,
 				 i - first_set, bp);
+		check_tree(&bp->root, __func__);
 		first_set = -1;
 	}
-	if (first_set != -1)
+	if (first_set != -1) {
 		rb_insert_extent(start + first_set - bitmap->start,
 				 num - first_set, bp);
+		check_tree(&bp->root, __func__);
+	}
 
 	return 0;
 }
@@ -799,6 +806,7 @@ static void rb_clear_bmap(ext2fs_generic_bitmap bitmap)
 	bp->rcursor = NULL;
 	bp->rcursor_next = NULL;
 	bp->wcursor = NULL;
+	check_tree(&bp->root, __func__);
 }
 
 #ifdef BMAP_STATS
-- 
1.8.5.rc3.362.gdf10213


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 04/12] libext: optimize find_first_set() for bitarray-based bitmaps
  2014-01-20  5:54 [PATCH 00/12] e2fsprogs mke2fs optimizations and new features Theodore Ts'o
                   ` (2 preceding siblings ...)
  2014-01-20  5:54 ` [PATCH 03/12] libext2fs: build tst_bitmaps with rep invariants checking enabled Theodore Ts'o
@ 2014-01-20  5:54 ` Theodore Ts'o
  2014-01-20  5:54 ` [PATCH 05/12] libext2fs: optimize find_first_{zero,set}() for red-black tree based bitmaps Theodore Ts'o
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 24+ messages in thread
From: Theodore Ts'o @ 2014-01-20  5:54 UTC (permalink / raw)
  To: Ext4 Developers List; +Cc: Theodore Ts'o

Basically just a trivial adaption of the find_first_zero() function
for bitarray-based bitmaps.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
---
 lib/ext2fs/blkmap64_ba.c | 77 +++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 76 insertions(+), 1 deletion(-)

diff --git a/lib/ext2fs/blkmap64_ba.c b/lib/ext2fs/blkmap64_ba.c
index 284236d..894293a 100644
--- a/lib/ext2fs/blkmap64_ba.c
+++ b/lib/ext2fs/blkmap64_ba.c
@@ -391,6 +391,80 @@ static errcode_t ba_find_first_zero(ext2fs_generic_bitmap bitmap,
 	return ENOENT;
 }
 
+/* Find the first one bit between start and end, inclusive. */
+static errcode_t ba_find_first_set(ext2fs_generic_bitmap bitmap,
+				    __u64 start, __u64 end, __u64 *out)
+{
+	ext2fs_ba_private bp = (ext2fs_ba_private)bitmap->private;
+	unsigned long bitpos = start - bitmap->start;
+	unsigned long count = end - start + 1;
+	int byte_found = 0; /* whether a != 0xff byte has been found */
+	const unsigned char *pos;
+	unsigned long max_loop_count, i;
+
+	/* scan bits until we hit a byte boundary */
+	while ((bitpos & 0x7) != 0 && count > 0) {
+		if (ext2fs_test_bit64(bitpos, bp->bitarray)) {
+			*out = bitpos + bitmap->start;
+			return 0;
+		}
+		bitpos++;
+		count--;
+	}
+
+	if (!count)
+		return ENOENT;
+
+	pos = ((unsigned char *)bp->bitarray) + (bitpos >> 3);
+	/* scan bytes until 8-byte (64-bit) aligned */
+	while (count >= 8 && (((unsigned long)pos) & 0x07)) {
+		if (*pos != 0) {
+			byte_found = 1;
+			break;
+		}
+		pos++;
+		count -= 8;
+		bitpos += 8;
+	}
+
+	if (!byte_found) {
+		max_loop_count = count >> 6; /* 8-byte blocks */
+		i = max_loop_count;
+		while (i) {
+			if (*((const __u64 *)pos) != 0)
+				break;
+			pos += 8;
+			i--;
+		}
+		count -= 64 * (max_loop_count - i);
+		bitpos += 64 * (max_loop_count - i);
+
+		max_loop_count = count >> 3;
+		i = max_loop_count;
+		while (i) {
+			if (*pos != 0) {
+				byte_found = 1;
+				break;
+			}
+			pos++;
+			i--;
+		}
+		count -= 8 * (max_loop_count - i);
+		bitpos += 8 * (max_loop_count - i);
+	}
+
+	/* Here either count < 8 or byte_found == 1. */
+	while (count-- > 0) {
+		if (ext2fs_test_bit64(bitpos, bp->bitarray)) {
+			*out = bitpos + bitmap->start;
+			return 0;
+		}
+		bitpos++;
+	}
+
+	return ENOENT;
+}
+
 struct ext2_bitmap_ops ext2fs_blkmap64_bitarray = {
 	.type = EXT2FS_BMAP64_BITARRAY,
 	.new_bmap = ba_new_bmap,
@@ -407,5 +481,6 @@ struct ext2_bitmap_ops ext2fs_blkmap64_bitarray = {
 	.get_bmap_range = ba_get_bmap_range,
 	.clear_bmap = ba_clear_bmap,
 	.print_stats = ba_print_stats,
-	.find_first_zero = ba_find_first_zero
+	.find_first_zero = ba_find_first_zero,
+	.find_first_set = ba_find_first_set
 };
-- 
1.8.5.rc3.362.gdf10213


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 05/12] libext2fs: optimize find_first_{zero,set}() for red-black tree based bitmaps
  2014-01-20  5:54 [PATCH 00/12] e2fsprogs mke2fs optimizations and new features Theodore Ts'o
                   ` (3 preceding siblings ...)
  2014-01-20  5:54 ` [PATCH 04/12] libext: optimize find_first_set() for bitarray-based bitmaps Theodore Ts'o
@ 2014-01-20  5:54 ` Theodore Ts'o
  2014-01-20  5:54 ` [PATCH 06/12] libext2fs: further clean up and rename check_block_uninit Theodore Ts'o
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 24+ messages in thread
From: Theodore Ts'o @ 2014-01-20  5:54 UTC (permalink / raw)
  To: Ext4 Developers List; +Cc: Theodore Ts'o

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
---
 lib/ext2fs/blkmap64_rb.c | 89 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 89 insertions(+)

diff --git a/lib/ext2fs/blkmap64_rb.c b/lib/ext2fs/blkmap64_rb.c
index 0e0a217..148f856 100644
--- a/lib/ext2fs/blkmap64_rb.c
+++ b/lib/ext2fs/blkmap64_rb.c
@@ -809,6 +809,93 @@ static void rb_clear_bmap(ext2fs_generic_bitmap bitmap)
 	check_tree(&bp->root, __func__);
 }
 
+static errcode_t rb_find_first_zero(ext2fs_generic_bitmap bitmap,
+				   __u64 start, __u64 end, __u64 *out)
+{
+	struct rb_node *parent = NULL, **n;
+	struct rb_node *node, *next;
+	struct ext2fs_rb_private *bp;
+	struct bmap_rb_extent *ext;
+	int retval = 1;
+
+	bp = (struct ext2fs_rb_private *) bitmap->private;
+	n = &bp->root.rb_node;
+	start -= bitmap->start;
+	end -= bitmap->start;
+
+	if (start > end)
+		return EINVAL;
+
+	if (EXT2FS_RB_EMPTY_ROOT(&bp->root))
+		return ENOENT;
+
+	while (*n) {
+		parent = *n;
+		ext = node_to_extent(parent);
+		if (start < ext->start) {
+			n = &(*n)->rb_left;
+		} else if (start >= (ext->start + ext->count)) {
+			n = &(*n)->rb_right;
+		} else if (ext->start + ext->count <= end) {
+			*out = ext->start + ext->count + bitmap->start;
+			return 0;
+		} else
+			return ENOENT;
+	}
+
+	*out = start + bitmap->start;
+	return 0;
+}
+
+static errcode_t rb_find_first_set(ext2fs_generic_bitmap bitmap,
+				   __u64 start, __u64 end, __u64 *out)
+{
+	struct rb_node *parent = NULL, **n;
+	struct rb_node *node, *next;
+	struct ext2fs_rb_private *bp;
+	struct bmap_rb_extent *ext;
+	int retval = 1;
+
+	bp = (struct ext2fs_rb_private *) bitmap->private;
+	n = &bp->root.rb_node;
+	start -= bitmap->start;
+	end -= bitmap->start;
+
+	if (start > end)
+		return EINVAL;
+
+	if (EXT2FS_RB_EMPTY_ROOT(&bp->root))
+		return ENOENT;
+
+	while (*n) {
+		parent = *n;
+		ext = node_to_extent(parent);
+		if (start < ext->start) {
+			n = &(*n)->rb_left;
+		} else if (start >= (ext->start + ext->count)) {
+			n = &(*n)->rb_right;
+		} else {
+			/* The start bit is set */
+			*out = start + bitmap->start;
+			return 0;
+		}
+	}
+
+	node = parent;
+	ext = node_to_extent(node);
+	if (ext->start < start) {
+		node = ext2fs_rb_next(node);
+		if (node == NULL)
+			return ENOENT;
+		ext = node_to_extent(node);
+	}
+	if (ext->start <= end) {
+		*out = ext->start + bitmap->start;
+		return 0;
+	}
+	return ENOENT;
+}
+
 #ifdef BMAP_STATS
 static void rb_print_stats(ext2fs_generic_bitmap bitmap)
 {
@@ -890,4 +977,6 @@ struct ext2_bitmap_ops ext2fs_blkmap64_rbtree = {
 	.get_bmap_range = rb_get_bmap_range,
 	.clear_bmap = rb_clear_bmap,
 	.print_stats = rb_print_stats,
+	.find_first_zero = rb_find_first_zero,
+	.find_first_set = rb_find_first_set,
 };
-- 
1.8.5.rc3.362.gdf10213


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 06/12] libext2fs: further clean up and rename check_block_uninit
  2014-01-20  5:54 [PATCH 00/12] e2fsprogs mke2fs optimizations and new features Theodore Ts'o
                   ` (4 preceding siblings ...)
  2014-01-20  5:54 ` [PATCH 05/12] libext2fs: optimize find_first_{zero,set}() for red-black tree based bitmaps Theodore Ts'o
@ 2014-01-20  5:54 ` Theodore Ts'o
  2014-01-20 20:17   ` Darrick J. Wong
  2014-01-20  5:54 ` [PATCH 07/12] libext2fs: add ext2fs_block_alloc_stats_range() Theodore Ts'o
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 24+ messages in thread
From: Theodore Ts'o @ 2014-01-20  5:54 UTC (permalink / raw)
  To: Ext4 Developers List; +Cc: Theodore Ts'o, Darrick J. Wong

Commit 8e44eb64bb (libext2fs: mark group data blocks when loading
block bitmap) simplified check_block_uninit since we are now
initializing the bitmap when it is loaded from disk.  It left some
variables which were being set but never used, however.  In addition,
since we only need check_block_uninit() to clear the block bitmap's
uninit flag, rename it to clear_block_uninit(), and only call it once
we have found a free block in ext2fs_new_blocks2().

This cleans up the code some and optimizes things if we need to search
multiple block groups trying to find a free block.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
---
 lib/ext2fs/alloc.c | 33 +++++----------------------------
 1 file changed, 5 insertions(+), 28 deletions(-)

diff --git a/lib/ext2fs/alloc.c b/lib/ext2fs/alloc.c
index 42096a4..91637dd 100644
--- a/lib/ext2fs/alloc.c
+++ b/lib/ext2fs/alloc.c
@@ -27,31 +27,15 @@
 #include "ext2fs.h"
 
 /*
- * Check for uninit block bitmaps and deal with them appropriately
+ * Clear the uninit block bitmap flag if necessary
  */
-static void check_block_uninit(ext2_filsys fs, ext2fs_block_bitmap map,
-			       dgrp_t group)
+static void clear_block_uninit(ext2_filsys fs, dgrp_t group)
 {
-	blk_t		i;
-	blk64_t		blk, super_blk, old_desc_blk, new_desc_blk;
-	int		old_desc_blocks;
-
 	if (!(EXT2_HAS_RO_COMPAT_FEATURE(fs->super,
 					 EXT4_FEATURE_RO_COMPAT_GDT_CSUM)) ||
 	    !(ext2fs_bg_flags_test(fs, group, EXT2_BG_BLOCK_UNINIT)))
 		return;
 
-	blk = ext2fs_group_first_block2(fs, group);
-
-	ext2fs_super_and_bgd_loc2(fs, group, &super_blk,
-				  &old_desc_blk, &new_desc_blk, 0);
-
-	if (fs->super->s_feature_incompat &
-	    EXT2_FEATURE_INCOMPAT_META_BG)
-		old_desc_blocks = fs->super->s_first_meta_bg;
-	else
-		old_desc_blocks = fs->desc_blocks + fs->super->s_reserved_gdt_blocks;
-
 	/* uninit block bitmaps are now initialized in read_bitmaps() */
 
 	ext2fs_bg_flags_clear(fs, group, EXT2_BG_BLOCK_UNINIT);
@@ -78,10 +62,11 @@ static void check_inode_uninit(ext2_filsys fs, ext2fs_inode_bitmap map,
 		ext2fs_fast_unmark_inode_bitmap2(map, ino);
 
 	ext2fs_bg_flags_clear(fs, group, EXT2_BG_INODE_UNINIT);
+	/* Mimics what the kernel does */
+	ext2fs_bg_flags_clear(fs, group, EXT2_BG_BLOCK_UNINIT);
 	ext2fs_group_desc_csum_set(fs, group);
 	ext2fs_mark_ib_dirty(fs);
 	ext2fs_mark_super_dirty(fs);
-	check_block_uninit(fs, fs->block_map, group);
 }
 
 /*
@@ -167,17 +152,9 @@ errcode_t ext2fs_new_block2(ext2_filsys fs, blk64_t goal,
 	c_ratio = 1 << ext2fs_get_bitmap_granularity(map);
 	if (c_ratio > 1)
 		goal &= ~EXT2FS_CLUSTER_MASK(fs);
-	check_block_uninit(fs, map,
-			   (i - fs->super->s_first_data_block) /
-			   EXT2_BLOCKS_PER_GROUP(fs->super));
 	do {
-		if (((i - fs->super->s_first_data_block) %
-		     EXT2_BLOCKS_PER_GROUP(fs->super)) == 0)
-			check_block_uninit(fs, map,
-					   (i - fs->super->s_first_data_block) /
-					   EXT2_BLOCKS_PER_GROUP(fs->super));

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 07/12] libext2fs: add ext2fs_block_alloc_stats_range()
  2014-01-20  5:54 [PATCH 00/12] e2fsprogs mke2fs optimizations and new features Theodore Ts'o
                   ` (5 preceding siblings ...)
  2014-01-20  5:54 ` [PATCH 06/12] libext2fs: further clean up and rename check_block_uninit Theodore Ts'o
@ 2014-01-20  5:54 ` Theodore Ts'o
  2014-02-13 21:50   ` Darrick J. Wong
  2014-01-20  5:54 ` [PATCH 08/12] libext2fs: optimize ext2fs_allocate_group_table() Theodore Ts'o
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 24+ messages in thread
From: Theodore Ts'o @ 2014-01-20  5:54 UTC (permalink / raw)
  To: Ext4 Developers List; +Cc: Theodore Ts'o

This function is more efficient than using ext2fs_block_alloc_stats2()
for each block in a range.  The efficiencies come from being able to
set a block range in the block bitmap at once, and from being update
the block group descriptors once per block group.  Especially now that
we are checksuming the block group descriptors, and we are using red
black trees for the allocation bitmaps, these changes can make a huge
difference in the CPU time used by mke2fs when creating very large
file systems.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
---
 lib/ext2fs/alloc_stats.c | 41 +++++++++++++++++++++++++++++++++++++++++
 lib/ext2fs/ext2fs.h      |  2 ++
 2 files changed, 43 insertions(+)

diff --git a/lib/ext2fs/alloc_stats.c b/lib/ext2fs/alloc_stats.c
index adec363..7919c09 100644
--- a/lib/ext2fs/alloc_stats.c
+++ b/lib/ext2fs/alloc_stats.c
@@ -106,3 +106,44 @@ void ext2fs_set_block_alloc_stats_callback(ext2_filsys fs,
 
 	fs->block_alloc_stats = func;
 }
+
+void ext2fs_block_alloc_stats_range(ext2_filsys fs, blk64_t blk,
+				    blk_t num, int inuse)
+{
+#ifndef OMIT_COM_ERR
+	if (blk + num >= ext2fs_blocks_count(fs->super)) {
+		com_err("ext2fs_block_alloc_stats_range", 0,
+			"Illegal block range: %llu (%u) ",
+			(unsigned long long) blk, num);
+		return;
+	}
+#endif
+	if (inuse == 0)
+		return;
+	if (inuse > 0) {
+		ext2fs_mark_block_bitmap_range2(fs->block_map, blk, num);
+		inuse = 1;
+	} else {
+		ext2fs_unmark_block_bitmap_range2(fs->block_map, blk, num);
+		inuse = -1;
+	}
+	while (num) {
+		int group = ext2fs_group_of_blk2(fs, blk);
+		blk64_t last_blk = ext2fs_group_last_block2(fs, group);
+		blk_t n = num;
+
+		if (blk + num > last_blk)
+			n = last_blk - blk + 1;
+
+		ext2fs_bg_free_blocks_count_set(fs, group,
+			ext2fs_bg_free_blocks_count(fs, group) -
+			inuse*n/EXT2FS_CLUSTER_RATIO(fs));
+		ext2fs_bg_flags_clear(fs, group, EXT2_BG_BLOCK_UNINIT);
+		ext2fs_group_desc_csum_set(fs, group);
+		ext2fs_free_blocks_count_add(fs->super, -inuse * n);
+		blk += n;
+		num -= n;
+	}
+	ext2fs_mark_super_dirty(fs);
+	ext2fs_mark_bb_dirty(fs);
+}
diff --git a/lib/ext2fs/ext2fs.h b/lib/ext2fs/ext2fs.h
index 1e07f88..47340dd 100644
--- a/lib/ext2fs/ext2fs.h
+++ b/lib/ext2fs/ext2fs.h
@@ -683,6 +683,8 @@ void ext2fs_inode_alloc_stats2(ext2_filsys fs, ext2_ino_t ino,
 			       int inuse, int isdir);
 void ext2fs_block_alloc_stats(ext2_filsys fs, blk_t blk, int inuse);
 void ext2fs_block_alloc_stats2(ext2_filsys fs, blk64_t blk, int inuse);
+void ext2fs_block_alloc_stats_range(ext2_filsys fs, blk64_t blk,
+				    blk_t num, int inuse);
 
 /* alloc_tables.c */
 extern errcode_t ext2fs_allocate_tables(ext2_filsys fs);
-- 
1.8.5.rc3.362.gdf10213


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 08/12] libext2fs: optimize ext2fs_allocate_group_table()
  2014-01-20  5:54 [PATCH 00/12] e2fsprogs mke2fs optimizations and new features Theodore Ts'o
                   ` (6 preceding siblings ...)
  2014-01-20  5:54 ` [PATCH 07/12] libext2fs: add ext2fs_block_alloc_stats_range() Theodore Ts'o
@ 2014-01-20  5:54 ` Theodore Ts'o
  2014-01-20  5:54 ` [PATCH 09/12] libext2: optimize ext2fs_new_block2() Theodore Ts'o
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 24+ messages in thread
From: Theodore Ts'o @ 2014-01-20  5:54 UTC (permalink / raw)
  To: Ext4 Developers List; +Cc: Theodore Ts'o

By using ext2fs_mark_block_bitmap_range2 and/or
ext2fs_block_alloc_stats_range(), we can significantly speed up the
time needed by mke2fs to allocate the inode table.

For example, the CPU time needed to run the command "mke2fs -t ext4
/tmp/foo.img 32T" (where tmpfs was mounted on /tmp) was decreased from
21.7 CPU seconds down to under 1.7 seconds.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
---
 lib/ext2fs/alloc_tables.c | 22 +++++++---------------
 1 file changed, 7 insertions(+), 15 deletions(-)

diff --git a/lib/ext2fs/alloc_tables.c b/lib/ext2fs/alloc_tables.c
index 9f3d4e0..fec9003 100644
--- a/lib/ext2fs/alloc_tables.c
+++ b/lib/ext2fs/alloc_tables.c
@@ -83,9 +83,8 @@ static blk64_t flexbg_offset(ext2_filsys fs, dgrp_t group, blk64_t start_blk,
 errcode_t ext2fs_allocate_group_table(ext2_filsys fs, dgrp_t group,
 				      ext2fs_block_bitmap bmap)
 {
-	unsigned int	j;
 	errcode_t	retval;
-	blk64_t		group_blk, start_blk, last_blk, new_blk, blk;
+	blk64_t		group_blk, start_blk, last_blk, new_blk;
 	dgrp_t		last_grp = 0;
 	int		rem_grps = 0, flexbg_size = 0;
 
@@ -205,19 +204,12 @@ errcode_t ext2fs_allocate_group_table(ext2_filsys fs, dgrp_t group,
 						bmap, &new_blk);
 		if (retval)
 			return retval;
-		for (j=0, blk = new_blk;
-		     j < fs->inode_blocks_per_group;
-		     j++, blk++) {
-			ext2fs_mark_block_bitmap2(bmap, blk);
-			if (flexbg_size) {
-				dgrp_t gr = ext2fs_group_of_blk2(fs, blk);
-				ext2fs_bg_free_blocks_count_set(fs, gr, ext2fs_bg_free_blocks_count(fs, gr) - 1);
-				ext2fs_free_blocks_count_add(fs->super, -1);
-				ext2fs_bg_flags_clear(fs, gr,
-						     EXT2_BG_BLOCK_UNINIT);
-				ext2fs_group_desc_csum_set(fs, gr);
-			}
-		}
+		if (flexbg_size)
+			ext2fs_block_alloc_stats_range(fs, new_blk,
+				       fs->inode_blocks_per_group, +1);
+		else
+			ext2fs_mark_block_bitmap_range2(fs->block_map,
+					new_blk, fs->inode_blocks_per_group);
 		ext2fs_inode_table_loc_set(fs, group, new_blk);
 	}
 	ext2fs_group_desc_csum_set(fs, group);
-- 
1.8.5.rc3.362.gdf10213


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 09/12] libext2: optimize ext2fs_new_block2()
  2014-01-20  5:54 [PATCH 00/12] e2fsprogs mke2fs optimizations and new features Theodore Ts'o
                   ` (7 preceding siblings ...)
  2014-01-20  5:54 ` [PATCH 08/12] libext2fs: optimize ext2fs_allocate_group_table() Theodore Ts'o
@ 2014-01-20  5:54 ` Theodore Ts'o
  2014-01-20 21:52   ` Andreas Dilger
  2014-01-20  5:54 ` [PATCH 10/12] mke2fs: optimize fix_cluster_bg_counts() Theodore Ts'o
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 24+ messages in thread
From: Theodore Ts'o @ 2014-01-20  5:54 UTC (permalink / raw)
  To: Ext4 Developers List; +Cc: Theodore Ts'o

If there are hundreds of thousands of blocks which are in use before
the first free block, it is much, MUCH faster to use
ext2fs_find_first_zero_block_bitmap2() instead of searching the
allocation bitmap bit by bit.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
---
 lib/ext2fs/alloc.c | 34 +++++++++++++++++-----------------
 1 file changed, 17 insertions(+), 17 deletions(-)

diff --git a/lib/ext2fs/alloc.c b/lib/ext2fs/alloc.c
index 91637dd..7dcfd3c 100644
--- a/lib/ext2fs/alloc.c
+++ b/lib/ext2fs/alloc.c
@@ -137,8 +137,8 @@ errcode_t ext2fs_new_inode(ext2_filsys fs, ext2_ino_t dir,
 errcode_t ext2fs_new_block2(ext2_filsys fs, blk64_t goal,
 			   ext2fs_block_bitmap map, blk64_t *ret)
 {
-	blk64_t	i;
-	int	c_ratio;
+	errcode_t retval;
+	blk64_t	b;
 
 	EXT2_CHECK_MAGIC(fs, EXT2_ET_MAGIC_EXT2FS_FILSYS);
 
@@ -148,21 +148,21 @@ errcode_t ext2fs_new_block2(ext2_filsys fs, blk64_t goal,
 		return EXT2_ET_NO_BLOCK_BITMAP;
 	if (!goal || (goal >= ext2fs_blocks_count(fs->super)))
 		goal = fs->super->s_first_data_block;
-	i = goal;
-	c_ratio = 1 << ext2fs_get_bitmap_granularity(map);
-	if (c_ratio > 1)
-		goal &= ~EXT2FS_CLUSTER_MASK(fs);
-	do {
-		if (!ext2fs_fast_test_block_bitmap2(map, i)) {
-			clear_block_uninit(fs, ext2fs_group_of_blk2(fs, i));
-			*ret = i;
-			return 0;
-		}
-		i = (i + c_ratio) & ~(c_ratio - 1);
-		if (i >= ext2fs_blocks_count(fs->super))
-			i = fs->super->s_first_data_block;
-	} while (i != goal);
-	return EXT2_ET_BLOCK_ALLOC_FAIL;
+	goal &= ~EXT2FS_CLUSTER_MASK(fs);
+
+	retval = ext2fs_find_first_zero_block_bitmap2(map,
+			goal, ext2fs_blocks_count(fs->super) - 1, &b);
+	if ((retval == ENOENT) && (goal != fs->super->s_first_data_block))
+		retval = ext2fs_find_first_zero_block_bitmap2(map,
+			fs->super->s_first_data_block, goal - 1, &b);
+	if (retval == ENOENT)
+		return EXT2_ET_BLOCK_ALLOC_FAIL;
+	if (retval)
+		return retval;
+
+	clear_block_uninit(fs, ext2fs_group_of_blk2(fs, b));
+	*ret = b;
+	return 0;
 }
 
 errcode_t ext2fs_new_block(ext2_filsys fs, blk_t goal,
-- 
1.8.5.rc3.362.gdf10213


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 10/12] mke2fs: optimize fix_cluster_bg_counts()
  2014-01-20  5:54 [PATCH 00/12] e2fsprogs mke2fs optimizations and new features Theodore Ts'o
                   ` (8 preceding siblings ...)
  2014-01-20  5:54 ` [PATCH 09/12] libext2: optimize ext2fs_new_block2() Theodore Ts'o
@ 2014-01-20  5:54 ` Theodore Ts'o
  2014-01-20  5:54 ` [PATCH 11/12] Add support for new compat feature "sparse_super2" Theodore Ts'o
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 24+ messages in thread
From: Theodore Ts'o @ 2014-01-20  5:54 UTC (permalink / raw)
  To: Ext4 Developers List; +Cc: Theodore Ts'o

Instead of iterating over the allocation bitmap using
ext2fs_test_block_bitmap2(), bit by bit, use
ext2fs_find_first_set_block_bitmap2() instead.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
---
 misc/mke2fs.c | 41 +++++++++++++++++++++++++++--------------
 1 file changed, 27 insertions(+), 14 deletions(-)

diff --git a/misc/mke2fs.c b/misc/mke2fs.c
index c45b42f..7daa87e 100644
--- a/misc/mke2fs.c
+++ b/misc/mke2fs.c
@@ -2332,27 +2332,40 @@ static int mke2fs_discard_device(ext2_filsys fs)
 
 static void fix_cluster_bg_counts(ext2_filsys fs)
 {
-	blk64_t	cluster, num_clusters, tot_free;
-	unsigned num = 0;
-	int	grp_free, num_free, group;
+	blk64_t		cluster, num_clusters, tot_free, last_cluster, next;
+	errcode_t	retval;
+	dgrp_t		group = 0;
+	int		grp_free = 0, num_free = 0;
 
 	num_clusters = EXT2FS_B2C(fs, ext2fs_blocks_count(fs->super));
-	tot_free = num_free = group = grp_free = 0;
-	for (cluster = EXT2FS_B2C(fs, fs->super->s_first_data_block);
-	     cluster < num_clusters; cluster++) {
-		if (!ext2fs_test_block_bitmap2(fs->block_map,
-					       EXT2FS_C2B(fs, cluster))) {
-			grp_free++;
-			tot_free++;
+	last_cluster = EXT2FS_B2C(fs, ext2fs_group_last_block2(fs, group));
+	cluster = EXT2FS_B2C(fs, fs->super->s_first_data_block);
+	while (cluster < num_clusters) {
+		retval = ext2fs_find_first_zero_block_bitmap2(fs->block_map,
+						cluster, last_cluster, &next);
+		if (retval == 0)
+			cluster = next;
+		else {
+			cluster = last_cluster + 1;
+			goto next_bg;
 		}
-		num++;
-		if ((num == fs->super->s_clusters_per_group) ||
-		    (cluster == num_clusters-1)) {
+
+		retval = ext2fs_find_first_set_block_bitmap2(fs->block_map,
+						cluster, last_cluster, &next);
+		if (retval)
+			next = last_cluster + 1;
+		grp_free += next - cluster;
+		tot_free += next - cluster;
+		cluster = next;
+
+		if (cluster > last_cluster) {
+		next_bg:
 			ext2fs_bg_free_blocks_count_set(fs, group, grp_free);
 			ext2fs_group_desc_csum_set(fs, group);
-			num = 0;
 			grp_free = 0;
 			group++;
+			last_cluster = EXT2FS_B2C(fs,
+					ext2fs_group_last_block2(fs, group));
 		}
 	}
 	ext2fs_free_blocks_count_set(fs->super, EXT2FS_C2B(fs, tot_free));
-- 
1.8.5.rc3.362.gdf10213


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 11/12] Add support for new compat feature "sparse_super2"
  2014-01-20  5:54 [PATCH 00/12] e2fsprogs mke2fs optimizations and new features Theodore Ts'o
                   ` (9 preceding siblings ...)
  2014-01-20  5:54 ` [PATCH 10/12] mke2fs: optimize fix_cluster_bg_counts() Theodore Ts'o
@ 2014-01-20  5:54 ` Theodore Ts'o
  2014-01-20 21:49   ` Andreas Dilger
       [not found]   ` <alpine.DEB.2.10.1402051559390.16807@tglase.lan.tarent.de>
  2014-01-20  5:54 ` [PATCH 12/12] mke2fs: allow metadata blocks to be at the beginning of the file system Theodore Ts'o
  2014-01-20 16:30 ` [PATCH 00/12] e2fsprogs mke2fs optimizations and new features Theodore Ts'o
  12 siblings, 2 replies; 24+ messages in thread
From: Theodore Ts'o @ 2014-01-20  5:54 UTC (permalink / raw)
  To: Ext4 Developers List; +Cc: Theodore Ts'o

In practice, it is **extremely** rare for users to try to use more
than the first backup superblock located at the beginning of block
group #1.  (i.e., at block number 32768 for file systems with a 4k
block size).  This new compat feature restricts the backup superblock
to block group #1 and the last block group in the file system.

Aside from reducing the overhead of the file system by a small number
of blocks, by eliminating the rest of the backup superblocks, it
allows us to have a much more flexible metadata layout.  For example,
we can force all of the allocation bitmaps and inode table blocks to
the beginning of the disk, which allows most of the disk to be
exclusively used for contiguous data blocks.

This simplifies taking advantage of certain HDD specific features,
such as Shingled Magnetic Recording (aka Shingled Drives), and the
TCG's OPAL Storage Specification where having a simple mapping between
LBA block ranges and the data blocks used by the file system can make
life much simpler.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
---
 debugfs/set_fields.c        |   2 +
 lib/e2p/feature.c           |   2 +
 lib/e2p/ls.c                |   8 ++
 lib/ext2fs/closefs.c        |  12 ++-
 lib/ext2fs/ext2_fs.h        |   4 +-
 lib/ext2fs/ext2fs.h         |   3 +-
 lib/ext2fs/initialize.c     |   2 +
 lib/ext2fs/res_gdt.c        |  13 ++++
 lib/ext2fs/swapfs.c         |   2 +
 lib/ext2fs/tst_super_size.c |   3 +-
 misc/ext4.5.in              |  11 +++
 misc/mke2fs.8.in            |   6 ++
 misc/mke2fs.c               |  30 +++++++-
 misc/mke2fs.conf.5.in       |   5 ++
 resize/online.c             |   8 ++
 resize/resize2fs.c          | 182 +++++++++++++++++++++++++++++++++++++++++++-
 16 files changed, 284 insertions(+), 9 deletions(-)

diff --git a/debugfs/set_fields.c b/debugfs/set_fields.c
index 9c3b000..ffbda74 100644
--- a/debugfs/set_fields.c
+++ b/debugfs/set_fields.c
@@ -150,6 +150,8 @@ static struct field_set_info super_fields[] = {
 	{ "usr_quota_inum", &set_sb.s_usr_quota_inum, NULL, 4, parse_uint },
 	{ "grp_quota_inum", &set_sb.s_grp_quota_inum, NULL, 4, parse_uint },
 	{ "overhead_blocks", &set_sb.s_overhead_blocks, NULL, 4, parse_uint },
+	{ "backup_bgs", &set_sb.s_backup_bgs[0], NULL, 4, parse_uint,
+	  FLAG_ARRAY, 2 },
 	{ "checksum", &set_sb.s_checksum, NULL, 4, parse_uint },
 	{ 0, 0, 0, 0 }
 };
diff --git a/lib/e2p/feature.c b/lib/e2p/feature.c
index 9691263..1d3e689 100644
--- a/lib/e2p/feature.c
+++ b/lib/e2p/feature.c
@@ -43,6 +43,8 @@ static struct feature feature_list[] = {
 			"lazy_bg" },
 	{	E2P_FEATURE_COMPAT, EXT2_FEATURE_COMPAT_EXCLUDE_BITMAP,
 			"snapshot_bitmap" },
+	{	E2P_FEATURE_COMPAT, EXT4_FEATURE_COMPAT_SPARSE_SUPER2,
+			"sparse_super2" },
 
 	{	E2P_FEATURE_RO_INCOMPAT, EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER,
 			"sparse_super" },
diff --git a/lib/e2p/ls.c b/lib/e2p/ls.c
index 5b3d3c8..6f741c0 100644
--- a/lib/e2p/ls.c
+++ b/lib/e2p/ls.c
@@ -368,6 +368,14 @@ void list_super2(struct ext2_super_block * sb, FILE *f)
 			fprintf(f, "type %u\n", sb->s_jnl_backup_type);
 		}
 	}
+	if (sb->s_backup_bgs[0] || sb->s_backup_bgs[1]) {
+		fprintf(f, "Backup block groups:      ");
+		if (sb->s_backup_bgs[0])
+			fprintf(f, "%u ", sb->s_backup_bgs[0]);
+		if (sb->s_backup_bgs[1])
+			fprintf(f, "%u ", sb->s_backup_bgs[1]);
+		fputc('\n', f);
+	}
 	if (sb->s_snapshot_inum) {
 		fprintf(f, "Snapshot inode:           %u\n",
 			sb->s_snapshot_inum);
diff --git a/lib/ext2fs/closefs.c b/lib/ext2fs/closefs.c
index 3e4af7f..4e91778 100644
--- a/lib/ext2fs/closefs.c
+++ b/lib/ext2fs/closefs.c
@@ -35,8 +35,16 @@ static int test_root(unsigned int a, unsigned int b)
 
 int ext2fs_bg_has_super(ext2_filsys fs, dgrp_t group)
 {
-	if (!(fs->super->s_feature_ro_compat &
-	      EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER) || group <= 1)
+	if (group == 0)
+		return 1;
+	if (fs->super->s_feature_compat & EXT4_FEATURE_COMPAT_SPARSE_SUPER2) {
+		if (group == fs->super->s_backup_bgs[0] ||
+		    group == fs->super->s_backup_bgs[1])
+			return 1;
+		return 0;
+	}
+	if ((group <= 1) || !(fs->super->s_feature_ro_compat &
+			      EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER))
 		return 1;
 	if (!(group & 1))
 		return 0;
diff --git a/lib/ext2fs/ext2_fs.h b/lib/ext2fs/ext2_fs.h
index 930c2a3..d9e14d7 100644
--- a/lib/ext2fs/ext2_fs.h
+++ b/lib/ext2fs/ext2_fs.h
@@ -645,7 +645,8 @@ struct ext2_super_block {
 	__u32	s_usr_quota_inum;	/* inode number of user quota file */
 	__u32	s_grp_quota_inum;	/* inode number of group quota file */
 	__u32	s_overhead_blocks;	/* overhead blocks/clusters in fs */
-	__u32   s_reserved[108];        /* Padding to the end of the block */
+	__u32	s_backup_bgs[2];	/* If sparse_super2 enabled */
+	__u32   s_reserved[106];        /* Padding to the end of the block */
 	__u32	s_checksum;		/* crc32c(superblock) */
 };
 
@@ -696,6 +697,7 @@ struct ext2_super_block {
 #define EXT2_FEATURE_COMPAT_LAZY_BG		0x0040
 /* #define EXT2_FEATURE_COMPAT_EXCLUDE_INODE	0x0080 not used, legacy */
 #define EXT2_FEATURE_COMPAT_EXCLUDE_BITMAP	0x0100
+#define EXT4_FEATURE_COMPAT_SPARSE_SUPER2	0x0200
 
 
 #define EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER	0x0001
diff --git a/lib/ext2fs/ext2fs.h b/lib/ext2fs/ext2fs.h
index 47340dd..efe0964 100644
--- a/lib/ext2fs/ext2fs.h
+++ b/lib/ext2fs/ext2fs.h
@@ -550,7 +550,8 @@ typedef struct ext2_icount *ext2_icount_t;
 					 EXT3_FEATURE_COMPAT_HAS_JOURNAL|\
 					 EXT2_FEATURE_COMPAT_RESIZE_INODE|\
 					 EXT2_FEATURE_COMPAT_DIR_INDEX|\
-					 EXT2_FEATURE_COMPAT_EXT_ATTR)
+					 EXT2_FEATURE_COMPAT_EXT_ATTR|\
+					 EXT4_FEATURE_COMPAT_SPARSE_SUPER2)
 
 /* This #ifdef is temporary until compression is fully supported */
 #ifdef ENABLE_COMPRESSION
diff --git a/lib/ext2fs/initialize.c b/lib/ext2fs/initialize.c
index 2db8b3c..dc6c419 100644
--- a/lib/ext2fs/initialize.c
+++ b/lib/ext2fs/initialize.c
@@ -173,6 +173,8 @@ errcode_t ext2fs_initialize(const char *name, int flags,
 	set_field(s_raid_stripe_width, 0);	/* default stripe width: 0 */
 	set_field(s_log_groups_per_flex, 0);
 	set_field(s_flags, 0);
+	assign_field(s_backup_bgs[0]);
+	assign_field(s_backup_bgs[1]);
 	if (super->s_feature_incompat & ~EXT2_LIB_FEATURE_INCOMPAT_SUPP) {
 		retval = EXT2_ET_UNSUPP_FEATURE;
 		goto cleanup;
diff --git a/lib/ext2fs/res_gdt.c b/lib/ext2fs/res_gdt.c
index 6449228..e61c330 100644
--- a/lib/ext2fs/res_gdt.c
+++ b/lib/ext2fs/res_gdt.c
@@ -31,6 +31,19 @@ static unsigned int list_backups(ext2_filsys fs, unsigned int *three,
 	int mult = 3;
 	unsigned int ret;
 
+	if (fs->super->s_feature_compat & EXT4_FEATURE_COMPAT_SPARSE_SUPER2) {
+		if (*min == 1) {
+			*min += 1;
+			if (fs->super->s_backup_bgs[0])
+				return fs->super->s_backup_bgs[0];
+		}
+		if (*min == 2) {
+			*min += 1;
+			if (fs->super->s_backup_bgs[1])
+				return fs->super->s_backup_bgs[1];
+		}
+		return fs->group_desc_count;
+	}
 	if (!(fs->super->s_feature_ro_compat &
 	      EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER)) {
 		ret = *min;
diff --git a/lib/ext2fs/swapfs.c b/lib/ext2fs/swapfs.c
index 56c66cc..2a7b768 100644
--- a/lib/ext2fs/swapfs.c
+++ b/lib/ext2fs/swapfs.c
@@ -99,6 +99,8 @@ void ext2fs_swap_super(struct ext2_super_block * sb)
 	}
 	for (; i < 17; i++)
 		sb->s_jnl_blocks[i] = ext2fs_swab32(sb->s_jnl_blocks[i]);
+	sb->s_backup_bgs[0] = ext2fs_swab32(sb->s_backup_bgs[0]);
+	sb->s_backup_bgs[1] = ext2fs_swab32(sb->s_backup_bgs[1]);
 }
 
 void ext2fs_swap_group_desc2(ext2_filsys fs, struct ext2_group_desc *gdp)
diff --git a/lib/ext2fs/tst_super_size.c b/lib/ext2fs/tst_super_size.c
index 85d87e1..f9cec8a 100644
--- a/lib/ext2fs/tst_super_size.c
+++ b/lib/ext2fs/tst_super_size.c
@@ -135,7 +135,8 @@ int main(int argc, char **argv)
 	check_field(s_usr_quota_inum, 4);
 	check_field(s_grp_quota_inum, 4);
 	check_field(s_overhead_blocks, 4);
-	check_field(s_reserved, 108 * 4);
+	check_field(s_backup_bgs, 8);
+	check_field(s_reserved, 106 * 4);
 	check_field(s_checksum, 4);
 	do_field("Superblock end", 0, 0, cur_offset, 1024);
 #endif
diff --git a/misc/ext4.5.in b/misc/ext4.5.in
index fab1139..5ec39f5 100644
--- a/misc/ext4.5.in
+++ b/misc/ext4.5.in
@@ -171,6 +171,17 @@ kernels from mounting file systems that they could not understand.
 .\" .br
 .\" .B Future feature, available in e2fsprogs 1.43-WIP
 .TP
+.B sparse_super2
+.br
+This feature indicates that there will only at most two backup
+superblock and block group descriptors.  The block groups used to store
+the backup superblock and blockgroup descriptors are stored in the
+superblock, but typically, one will be located at the beginning of block
+group #1, and one in the last block group in the file system.  This is
+feature is essentially a more extreme version of sparse_super and is
+designed to allow the a much larger percentage of the disk to have
+contiguous blocks available for data files.
+.TP
 .B meta_bg
 .br
 This ext4 feature allows file systems to be resized on-line without explicitly
diff --git a/misc/mke2fs.8.in b/misc/mke2fs.8.in
index 67ddbf8..483fb1c 100644
--- a/misc/mke2fs.8.in
+++ b/misc/mke2fs.8.in
@@ -274,6 +274,12 @@ small risk if the system crashes before the journal has been overwritten
 entirely one time.  If the option value is omitted, it defaults to 1 to
 enable lazy journal inode zeroing.
 .TP
+.BI num_backup_sb= <0|1|2>
+If the
+.B sparse_super2
+file system feature is enabled this option controls whether there will
+be 0, 1, or 2 backup superblocks created in the file system.
+.TP
 .BI root_owner [=uid:gid]
 Specify the numeric user and group ID of the root directory.  If no UID:GID
 is specified, use the user and group ID of the user running \fBmke2fs\fR.
diff --git a/misc/mke2fs.c b/misc/mke2fs.c
index 7daa87e..efb068a 100644
--- a/misc/mke2fs.c
+++ b/misc/mke2fs.c
@@ -88,6 +88,7 @@ static int	discard = 1;	/* attempt to discard device before fs creation */
 static int	direct_io;
 static int	force;
 static int	noaction;
+static int	num_backups = 2; /* number of backup bg's for sparse_super2 */
 static uid_t	root_uid;
 static gid_t	root_gid;
 int	journal_size;
@@ -738,6 +739,21 @@ static void parse_extended_opts(struct ext2_super_block *param,
 				r_usage++;
 				continue;
 			}
+		} else if (strcmp(token, "num_backup_sb") == 0) {
+			if (!arg) {
+				r_usage++;
+				badopt = token;
+				continue;
+			}
+			num_backups = strtoul(arg, &p, 0);
+			if (*p || num_backups > 2) {
+				fprintf(stderr,
+					_("Invalid # of backup "
+					  "superbocks: %s\n"),
+					arg);
+				r_usage++;
+				continue;
+			}
 		} else if (strcmp(token, "stride") == 0) {
 			if (!arg) {
 				r_usage++;
@@ -894,6 +910,7 @@ static void parse_extended_opts(struct ext2_super_block *param,
 			"\tis set off by an equals ('=') sign.\n\n"
 			"Valid extended options are:\n"
 			"\tmmp_update_interval=<interval>\n"
+			"\tnum_backup_sb=<0|1|2>\n"
 			"\tstride=<RAID per-disk data chunk in blocks>\n"
 			"\tstripe-width=<RAID stride * data disks in blocks>\n"
 			"\toffset=<offset to create the file system>\n"
@@ -924,7 +941,8 @@ static __u32 ok_features[3] = {
 	EXT3_FEATURE_COMPAT_HAS_JOURNAL |
 		EXT2_FEATURE_COMPAT_RESIZE_INODE |
 		EXT2_FEATURE_COMPAT_DIR_INDEX |
-		EXT2_FEATURE_COMPAT_EXT_ATTR,
+		EXT2_FEATURE_COMPAT_EXT_ATTR |
+		EXT4_FEATURE_COMPAT_SPARSE_SUPER2,
 	/* Incompat */
 	EXT2_FEATURE_INCOMPAT_FILETYPE|
 		EXT3_FEATURE_INCOMPAT_EXTENTS|
@@ -1974,6 +1992,8 @@ profile_error:
 	}
 #endif
 
+	num_backups = get_int_from_profile(fs_types, "num_backup_sb", 2);
+
 	blocksize = EXT2_BLOCK_SIZE(&fs_param);
 
 	/*
@@ -2593,8 +2613,14 @@ int main (int argc, char *argv[])
 		read_bb_file(fs, &bb_list, bad_blocks_filename);
 	if (cflag)
 		test_disk(fs, &bb_list);
-
 	handle_bad_blocks(fs, bb_list);
+
+	if (fs->super->s_feature_compat & EXT4_FEATURE_COMPAT_SPARSE_SUPER2) {
+		if (fs->group_desc_count > 1 && num_backups >= 1)
+			fs->super->s_backup_bgs[0] = 1;
+		if (fs->group_desc_count > 2 && num_backups >= 2)
+			fs->super->s_backup_bgs[1] = fs->group_desc_count - 1;
+	}
 	fs->stride = fs_stride = fs->super->s_raid_stride;
 	if (!quiet)
 		printf("%s", _("Allocating group tables: "));
diff --git a/misc/mke2fs.conf.5.in b/misc/mke2fs.conf.5.in
index 0625d0e..43bb91e 100644
--- a/misc/mke2fs.conf.5.in
+++ b/misc/mke2fs.conf.5.in
@@ -357,6 +357,11 @@ initialization noticeably, but it requires the kernel to finish
 initializing the filesystem in the background when the filesystem is
 first mounted.
 .TP
+.I num_backup_sb
+This relation indicates whether file systems with the
+.B sparse_super2
+feature enabled should be created with 0, 1, or 2 backup superblocks.
+.TP
 .I inode_ratio
 This relation specifies the default inode ratio if the user does not
 specify one on the command line.
diff --git a/resize/online.c b/resize/online.c
index defcac1..46d86b0 100644
--- a/resize/online.c
+++ b/resize/online.c
@@ -76,6 +76,14 @@ errcode_t online_resize_fs(ext2_filsys fs, const char *mtpt,
 			no_resize_ioctl = 1;
 	}
 
+	if (EXT2_HAS_COMPAT_FEATURE(fs->super,
+				    EXT4_FEATURE_COMPAT_SPARSE_SUPER2) &&
+	    (access("/sys/fs/ext4/features/sparse_super2", R_OK) != 0)) {
+		com_err(program_name, 0, _("kernel does not support online "
+					   "resize with sparse_super2"));
+		exit(1);
+	}
+
 	printf(_("Filesystem at %s is mounted on %s; "
 		 "on-line resizing required\n"), fs->device_name, mtpt);
 
diff --git a/resize/resize2fs.c b/resize/resize2fs.c
index c4c2517..b0c4b5e 100644
--- a/resize/resize2fs.c
+++ b/resize/resize2fs.c
@@ -53,6 +53,9 @@ static errcode_t ext2fs_calculate_summary_stats(ext2_filsys fs);
 static errcode_t fix_sb_journal_backup(ext2_filsys fs);
 static errcode_t mark_table_blocks(ext2_filsys fs,
 				   ext2fs_block_bitmap bmap);
+static errcode_t clear_sparse_super2_last_group(ext2_resize_t rfs);
+static errcode_t reserve_sparse_super2_last_group(ext2_resize_t rfs,
+						 ext2fs_block_bitmap meta_bmap);
 
 /*
  * Some helper CPP macros
@@ -191,6 +194,10 @@ errcode_t resize_fs(ext2_filsys fs, blk64_t *new_size, int flags,
 		goto errout;
 	print_resource_track(rfs, &rtrack, fs->io);
 
+	retval = clear_sparse_super2_last_group(rfs);
+	if (retval)
+		goto errout;
+
 	rfs->new_fs->super->s_state &= ~EXT2_ERROR_FS;
 	rfs->new_fs->flags &= ~EXT2_FLAG_MASTER_SB_ONLY;
 
@@ -460,6 +467,33 @@ retry:
 	}
 
 	/*
+	 * Update the location of the backup superblocks if the
+	 * sparse_super2 feature is enabled.
+	 */
+	if (fs->super->s_feature_compat & EXT4_FEATURE_COMPAT_SPARSE_SUPER2) {
+		dgrp_t last_bg = fs->group_desc_count - 1;
+		dgrp_t old_last_bg = old_fs->group_desc_count - 1;
+
+		if (last_bg > old_last_bg) {
+			if (old_fs->group_desc_count == 1)
+				fs->super->s_backup_bgs[0] = 1;
+			if (old_fs->group_desc_count == 1 &&
+			    fs->super->s_backup_bgs[0])
+				fs->super->s_backup_bgs[0] = last_bg;
+			else if (fs->super->s_backup_bgs[1])
+				fs->super->s_backup_bgs[1] = last_bg;
+		} else if (last_bg < old_last_bg) {
+			if (fs->super->s_backup_bgs[0] > last_bg)
+				fs->super->s_backup_bgs[0] = 0;
+			if (fs->super->s_backup_bgs[1] > last_bg)
+				fs->super->s_backup_bgs[1] = 0;
+			if (last_bg > 1 &&
+			    old_fs->super->s_backup_bgs[1] == old_last_bg)
+				fs->super->s_backup_bgs[1] = last_bg;
+		}
+	}
+
+	/*
 	 * If we are shrinking the number of block groups, we're done
 	 * and can exit now.
 	 */
@@ -615,14 +649,13 @@ errout:
  */
 static errcode_t adjust_superblock(ext2_resize_t rfs, blk64_t new_size)
 {
-	ext2_filsys fs;
+	ext2_filsys	fs = rfs->new_fs;
 	int		adj = 0;
 	errcode_t	retval;
 	blk64_t		group_block;
 	unsigned long	i;
 	unsigned long	max_group;
 
-	fs = rfs->new_fs;
 	ext2fs_mark_super_dirty(fs);
 	ext2fs_mark_bb_dirty(fs);
 	ext2fs_mark_ib_dirty(fs);
@@ -952,6 +985,10 @@ static errcode_t blocks_to_move(ext2_resize_t rfs)
 		new_blocks = fs->desc_blocks + fs->super->s_reserved_gdt_blocks;
 	}
 
+	retval = reserve_sparse_super2_last_group(rfs, meta_bmap);
+	if (retval)
+		goto errout;
+
 	if (old_blocks == new_blocks) {
 		retval = 0;
 		goto errout;
@@ -1840,6 +1877,147 @@ errout:
 }
 
 /*
+ * This function is used when expanding a file system.  It frees the
+ * superblock and block group descriptor blocks from the block group
+ * which is no longer the last block group.
+ */
+static errcode_t clear_sparse_super2_last_group(ext2_resize_t rfs)
+{
+	ext2_filsys	fs = rfs->new_fs;
+	ext2_filsys	old_fs = rfs->old_fs;
+	errcode_t	retval;
+	dgrp_t		old_last_bg = rfs->old_fs->group_desc_count - 1;
+	dgrp_t		last_bg = fs->group_desc_count - 1;
+	blk64_t		sb, old_desc;
+	blk_t		num;
+
+	if (!(fs->super->s_feature_compat & EXT4_FEATURE_COMPAT_SPARSE_SUPER2))
+		return 0;
+
+	if (last_bg <= old_last_bg)
+		return 0;
+
+	if (fs->super->s_backup_bgs[0] == old_fs->super->s_backup_bgs[0] &&
+	    fs->super->s_backup_bgs[1] == old_fs->super->s_backup_bgs[1])
+		return 0;
+
+	if (old_fs->super->s_backup_bgs[0] != old_last_bg &&
+	    old_fs->super->s_backup_bgs[1] != old_last_bg)
+		return 0;
+
+	if (fs->super->s_backup_bgs[0] == old_last_bg ||
+	    fs->super->s_backup_bgs[1] == old_last_bg)
+		return 0;
+
+	retval = ext2fs_super_and_bgd_loc2(rfs->old_fs, old_last_bg,
+					   &sb, &old_desc, NULL, &num);
+	if (retval)
+		return retval;
+
+	if (sb)
+		ext2fs_unmark_block_bitmap2(fs->block_map, sb);
+	if (old_desc)
+		ext2fs_unmark_block_bitmap_range2(fs->block_map, old_desc, num);
+	return 0;
+}
+
+/*
+ * This function is used when shrinking a file system.  We need to
+ * utilize blocks from what will be the new last block group for the
+ * backup superblock and block group descriptor blocks.
+ * Unfortunately, those blocks may be used by other files or fs
+ * metadata blocks.  We need to mark them as being in use.
+ */
+static errcode_t reserve_sparse_super2_last_group(ext2_resize_t rfs,
+						 ext2fs_block_bitmap meta_bmap)
+{
+	ext2_filsys	fs = rfs->new_fs;
+	ext2_filsys	old_fs = rfs->old_fs;
+	errcode_t	retval;
+	dgrp_t		old_last_bg = rfs->old_fs->group_desc_count - 1;
+	dgrp_t		last_bg = fs->group_desc_count - 1;
+	dgrp_t		g;
+	blk64_t		blk, sb, old_desc;
+	blk_t		i, num;
+	int		realloc = 0;
+
+	if (!(fs->super->s_feature_compat & EXT4_FEATURE_COMPAT_SPARSE_SUPER2))
+		return 0;
+
+	if (last_bg >= old_last_bg)
+		return 0;
+
+	if (fs->super->s_backup_bgs[0] == old_fs->super->s_backup_bgs[0] &&
+	    fs->super->s_backup_bgs[1] == old_fs->super->s_backup_bgs[1])
+		return 0;
+
+	if (fs->super->s_backup_bgs[0] != last_bg &&
+	    fs->super->s_backup_bgs[1] != last_bg)
+		return 0;
+
+	if (old_fs->super->s_backup_bgs[0] == last_bg ||
+	    old_fs->super->s_backup_bgs[1] == last_bg)
+		return 0;
+
+	retval = ext2fs_super_and_bgd_loc2(rfs->new_fs, last_bg,
+					   &sb, &old_desc, NULL, &num);
+	if (retval)
+		return retval;
+
+	if (!sb) {
+		fputs(_("Should never happen!  No sb in last super_sparse bg?\n"),
+		      stderr);
+		exit(1);
+	}
+	if (old_desc != sb+1) {
+		fputs(_("Should never happen!  Unexpected old_desc in "
+			"super_sparse bg?\n"),
+		      stderr);
+		exit(1);
+	}
+	num = (old_desc) ? num : 1;
+
+	/* Reserve the backup blocks */
+	ext2fs_mark_block_bitmap_range2(fs->block_map, sb, num);
+
+	for (g = 0; g < fs->group_desc_count; g++) {
+		blk64_t mb;
+
+		mb = ext2fs_block_bitmap_loc(fs, g);
+		if ((mb >= sb) && (mb < sb + num)) {
+			ext2fs_block_bitmap_loc_set(fs, g, 0);
+			realloc = 1;
+		}
+		mb = ext2fs_inode_bitmap_loc(fs, g);
+		if ((mb >= sb) && (mb < sb + num)) {
+			ext2fs_inode_bitmap_loc_set(fs, g, 0);
+			realloc = 1;
+		}
+		mb = ext2fs_inode_table_loc(fs, g);
+		if ((mb < sb + num) &&
+		    (sb < mb + fs->inode_blocks_per_group)) {
+			ext2fs_inode_table_loc_set(fs, g, 0);
+			realloc = 1;
+		}
+		if (realloc) {
+			retval = ext2fs_allocate_group_table(fs, g, 0);
+			if (retval)
+				return retval;
+		}
+	}
+
+	for (blk = sb, i = 0; i < num; blk++, i++) {
+		if (ext2fs_test_block_bitmap2(old_fs->block_map, blk) &&
+		    !ext2fs_test_block_bitmap2(meta_bmap, blk)) {
+			ext2fs_mark_block_bitmap2(rfs->move_blocks, blk);
+			rfs->needed_blocks++;
+		}
+		ext2fs_mark_block_bitmap2(rfs->reserve_blocks, blk);
+	}
+	return 0;
+}
+
+/*
  * Fix the resize inode
  */
 static errcode_t fix_resize_inode(ext2_filsys fs)
-- 
1.8.5.rc3.362.gdf10213


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 12/12] mke2fs: allow metadata blocks to be at the beginning of the file system
  2014-01-20  5:54 [PATCH 00/12] e2fsprogs mke2fs optimizations and new features Theodore Ts'o
                   ` (10 preceding siblings ...)
  2014-01-20  5:54 ` [PATCH 11/12] Add support for new compat feature "sparse_super2" Theodore Ts'o
@ 2014-01-20  5:54 ` Theodore Ts'o
  2014-01-20 16:30   ` Theodore Ts'o
  2014-01-20 23:25   ` Andreas Dilger
  2014-01-20 16:30 ` [PATCH 00/12] e2fsprogs mke2fs optimizations and new features Theodore Ts'o
  12 siblings, 2 replies; 24+ messages in thread
From: Theodore Ts'o @ 2014-01-20  5:54 UTC (permalink / raw)
  To: Ext4 Developers List; +Cc: Theodore Ts'o

Add the extended options packed_meta_blocks and journal_location_front
which causes mke2fs to place the metadata blocks at the beginning of
the file system.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
---
 lib/ext2fs/ext2fs.h    |  1 +
 lib/ext2fs/mkjournal.c |  5 +++-
 misc/mke2fs.8.in       | 14 +++++++++++
 misc/mke2fs.c          | 65 +++++++++++++++++++++++++++++++++++++++++++++++++-
 misc/mke2fs.conf.5.in  |  4 ++++
 5 files changed, 87 insertions(+), 2 deletions(-)

diff --git a/lib/ext2fs/ext2fs.h b/lib/ext2fs/ext2fs.h
index efe0964..fed6410 100644
--- a/lib/ext2fs/ext2fs.h
+++ b/lib/ext2fs/ext2fs.h
@@ -204,6 +204,7 @@ typedef struct ext2_file *ext2_file_t;
 #define EXT2_MKJOURNAL_V1_SUPER	0x0000001 /* create V1 superblock (deprecated) */
 #define EXT2_MKJOURNAL_LAZYINIT	0x0000002 /* don't zero journal inode before use*/
 #define EXT2_MKJOURNAL_NO_MNT_CHECK 0x0000004 /* don't check mount status */
+#define EXT2_MKJOURNAL_LOCATION_FRONT 0x0000004 /* journal at the beginning */
 
 struct opaque_ext2_group_desc;
 
diff --git a/lib/ext2fs/mkjournal.c b/lib/ext2fs/mkjournal.c
index d09c458..aa6d9af 100644
--- a/lib/ext2fs/mkjournal.c
+++ b/lib/ext2fs/mkjournal.c
@@ -360,7 +360,10 @@ static errcode_t write_journal_inode(ext2_filsys fs, ext2_ino_t journal_ino,
 		    ext2fs_bg_free_blocks_count(fs, group))
 			group = i;
 
-	es.goal = ext2fs_group_first_block2(fs, group);
+	if (flags & EXT2_MKJOURNAL_LOCATION_FRONT)
+		es.goal = 0;
+	else
+		es.goal = ext2fs_group_first_block2(fs, group);
 	retval = ext2fs_block_iterate3(fs, journal_ino, BLOCK_FLAG_APPEND,
 				       0, mkjournal_proc, &es);
 	if (es.err) {
diff --git a/misc/mke2fs.8.in b/misc/mke2fs.8.in
index 483fb1c..9f8c9f6 100644
--- a/misc/mke2fs.8.in
+++ b/misc/mke2fs.8.in
@@ -280,6 +280,20 @@ If the
 file system feature is enabled this option controls whether there will
 be 0, 1, or 2 backup superblocks created in the file system.
 .TP
+.B packed_meta_blocks\fR[\fB= \fI<0 to disable, 1 to enable>\fR]
+Place the allocation bitmaps and the inode table at the beginning of the
+disk.  This option requires that the flex_bg file system feature to be
+enabled in order for it to have effect, and also implies the
+extended option
+.IR journal_location_front .
+This option is useful for flash devices that use SLC flash at the beginning of
+the disk.   It also maximizes the range of contiguous data blocks, which
+can be useful for certain specialized use cases, such as supported
+Shingled Drives.
+.TP
+.B journal_location_front\fR[\fB= \fI<0 to disable, 1 to enable>\fR]
+Place the journal at the beginning of the file system.
+.TP
 .BI root_owner [=uid:gid]
 Specify the numeric user and group ID of the root directory.  If no UID:GID
 is specified, use the user and group ID of the user running \fBmke2fs\fR.
diff --git a/misc/mke2fs.c b/misc/mke2fs.c
index efb068a..363e96b 100644
--- a/misc/mke2fs.c
+++ b/misc/mke2fs.c
@@ -94,6 +94,7 @@ static gid_t	root_gid;
 int	journal_size;
 int	journal_flags;
 static int	lazy_itable_init;
+static int	packed_meta_blocks;
 static char	*bad_blocks_filename = NULL;
 static __u32	fs_stride;
 static int	quotatype = -1;  /* Initialize both user and group quotas by default */
@@ -310,6 +311,40 @@ _("Warning: the backup superblock/group descriptors at block %u contain\n"
 	ext2fs_badblocks_list_iterate_end(bb_iter);
 }
 
+static errcode_t packed_allocate_tables(ext2_filsys fs)
+{
+	errcode_t	retval;
+	dgrp_t		i;
+	blk64_t		goal = 0;
+
+	for (i = 0; i < fs->group_desc_count; i++) {
+		retval = ext2fs_new_block2(fs, goal, NULL, &goal);
+		if (retval)
+			return retval;
+		ext2fs_block_alloc_stats2(fs, goal, +1);
+		ext2fs_block_bitmap_loc_set(fs, i, goal);
+	}
+	for (i = 0; i < fs->group_desc_count; i++) {
+		retval = ext2fs_new_block2(fs, goal, NULL, &goal);
+		if (retval)
+			return retval;
+		ext2fs_block_alloc_stats2(fs, goal, +1);
+		ext2fs_inode_bitmap_loc_set(fs, i, goal);
+	}
+	for (i = 0; i < fs->group_desc_count; i++) {
+		blk64_t end = ext2fs_blocks_count(fs->super) - 1;
+		retval = ext2fs_get_free_blocks2(fs, goal, end,
+						 fs->inode_blocks_per_group,
+						 fs->block_map, &goal);
+		if (retval)
+			return retval;
+		ext2fs_block_alloc_stats_range(fs, goal,
+					       fs->inode_blocks_per_group, +1);
+		ext2fs_inode_table_loc_set(fs, i, goal);
+	}
+	return 0;
+}
+
 static void write_inode_tables(ext2_filsys fs, int lazy_flag, int itable_zeroed)
 {
 	errcode_t	retval;
@@ -712,6 +747,14 @@ static void parse_extended_opts(struct ext2_super_block *param,
 				continue;
 			}
 			param->s_desc_size = desc_size;
+		} else if (strcmp(token, "journal_location_front") == 0) {
+			unsigned long front = 1;
+			if (arg)
+				front = strtoul(arg, &p, 0);
+			if (front)
+				journal_flags |= EXT2_MKJOURNAL_LOCATION_FRONT;
+			else
+				journal_flags &= ~EXT2_MKJOURNAL_LOCATION_FRONT;
 		} else if (strcmp(token, "offset") == 0) {
 			if (!arg) {
 				r_usage++;
@@ -754,6 +797,15 @@ static void parse_extended_opts(struct ext2_super_block *param,
 				r_usage++;
 				continue;
 			}
+		} else if (strcmp(token, "packed_meta_blocks") == 0) {
+			if (arg)
+				packed_meta_blocks = strtoul(arg, &p, 0);
+			else
+				packed_meta_blocks = 1;
+			if (packed_meta_blocks)
+				journal_flags |= EXT2_MKJOURNAL_LOCATION_FRONT;
+			else
+				journal_flags &= ~EXT2_MKJOURNAL_LOCATION_FRONT;
 		} else if (strcmp(token, "stride") == 0) {
 			if (!arg) {
 				r_usage++;
@@ -915,8 +967,10 @@ static void parse_extended_opts(struct ext2_super_block *param,
 			"\tstripe-width=<RAID stride * data disks in blocks>\n"
 			"\toffset=<offset to create the file system>\n"
 			"\tresize=<resize maximum size in blocks>\n"
+			"\tpacked_meta_blocks=<0 to disable, 1 to enable>\n"
 			"\tlazy_itable_init=<0 to disable, 1 to enable>\n"
 			"\tlazy_journal_init=<0 to disable, 1 to enable>\n"
+			"\tjournal_location_front=<0 to disable, 1 to enable>\n"
 			"\troot_uid=<uid of root directory>\n"
 			"\troot_gid=<gid of root directory>\n"
 			"\ttest_fs\n"
@@ -2030,6 +2084,11 @@ profile_error:
 					       EXT2_MKJOURNAL_LAZYINIT : 0;
 	journal_flags |= EXT2_MKJOURNAL_NO_MNT_CHECK;
 
+	packed_meta_blocks = get_bool_from_profile(fs_types,
+						   "packed_meta_blocks", 0);
+	if (packed_meta_blocks)
+		journal_flags |= EXT2_MKJOURNAL_LOCATION_FRONT;
+
 	/* Get options from profile */
 	for (cpp = fs_types; *cpp; cpp++) {
 		tmp = NULL;
@@ -2624,7 +2683,11 @@ int main (int argc, char *argv[])
 	fs->stride = fs_stride = fs->super->s_raid_stride;
 	if (!quiet)
 		printf("%s", _("Allocating group tables: "));
-	retval = ext2fs_allocate_tables(fs);
+	if ((fs->super->s_feature_incompat & EXT4_FEATURE_INCOMPAT_FLEX_BG) &&
+	    packed_meta_blocks)
+		retval = packed_allocate_tables(fs);
+	else
+		retval = ext2fs_allocate_tables(fs);
 	if (retval) {
 		com_err(program_name, retval, "%s",
 			_("while trying to allocate filesystem tables"));
diff --git a/misc/mke2fs.conf.5.in b/misc/mke2fs.conf.5.in
index 43bb91e..1aba87b 100644
--- a/misc/mke2fs.conf.5.in
+++ b/misc/mke2fs.conf.5.in
@@ -362,6 +362,10 @@ This relation indicates whether file systems with the
 .B sparse_super2
 feature enabled should be created with 0, 1, or 2 backup superblocks.
 .TP
+.I packed_meta_blocks
+This boolean relation specifes whether the allocation bitmaps, inode
+table, and journal should be located at the beginning of the file system.
+.TP
 .I inode_ratio
 This relation specifies the default inode ratio if the user does not
 specify one on the command line.
-- 
1.8.5.rc3.362.gdf10213


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 00/12] e2fsprogs mke2fs optimizations and new features
  2014-01-20  5:54 [PATCH 00/12] e2fsprogs mke2fs optimizations and new features Theodore Ts'o
                   ` (11 preceding siblings ...)
  2014-01-20  5:54 ` [PATCH 12/12] mke2fs: allow metadata blocks to be at the beginning of the file system Theodore Ts'o
@ 2014-01-20 16:30 ` Theodore Ts'o
  12 siblings, 0 replies; 24+ messages in thread
From: Theodore Ts'o @ 2014-01-20 16:30 UTC (permalink / raw)
  To: Ext4 Developers List

On Mon, Jan 20, 2014 at 12:54:02AM -0500, Theodore Ts'o wrote:
> Theodore Ts'o (12):
>   libext: optimize find_first_set() for bitarray-based bitmaps
>   libext2: optimize ext2fs_new_block2()

I've fixed up the commit description for these two patches so that
they read "libext2fs".

						- Ted

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 12/12] mke2fs: allow metadata blocks to be at the beginning of the file system
  2014-01-20  5:54 ` [PATCH 12/12] mke2fs: allow metadata blocks to be at the beginning of the file system Theodore Ts'o
@ 2014-01-20 16:30   ` Theodore Ts'o
  2014-01-20 23:25   ` Andreas Dilger
  1 sibling, 0 replies; 24+ messages in thread
From: Theodore Ts'o @ 2014-01-20 16:30 UTC (permalink / raw)
  To: Ext4 Developers List

On Mon, Jan 20, 2014 at 12:54:14AM -0500, Theodore Ts'o wrote:
> Add the extended options packed_meta_blocks and journal_location_front
> which causes mke2fs to place the metadata blocks at the beginning of
> the file system.
> 
> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
>  #define EXT2_MKJOURNAL_LAZYINIT	0x0000002 /* don't zero journal inode before use*/
>  #define EXT2_MKJOURNAL_NO_MNT_CHECK 0x0000004 /* don't check mount status */
> +#define EXT2_MKJOURNAL_LOCATION_FRONT 0x0000004 /* journal at the beginning */

s/0x0000004/0x0000008/

					- Ted

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 06/12] libext2fs: further clean up and rename check_block_uninit
  2014-01-20  5:54 ` [PATCH 06/12] libext2fs: further clean up and rename check_block_uninit Theodore Ts'o
@ 2014-01-20 20:17   ` Darrick J. Wong
  0 siblings, 0 replies; 24+ messages in thread
From: Darrick J. Wong @ 2014-01-20 20:17 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Ext4 Developers List

On Mon, Jan 20, 2014 at 12:54:08AM -0500, Theodore Ts'o wrote:
> Commit 8e44eb64bb (libext2fs: mark group data blocks when loading
> block bitmap) simplified check_block_uninit since we are now
> initializing the bitmap when it is loaded from disk.  It left some
> variables which were being set but never used, however.  In addition,
> since we only need check_block_uninit() to clear the block bitmap's
> uninit flag, rename it to clear_block_uninit(), and only call it once
> we have found a free block in ext2fs_new_blocks2().
> 
> This cleans up the code some and optimizes things if we need to search
> multiple block groups trying to find a free block.
> 
> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
> Cc: Darrick J. Wong <darrick.wong@oracle.com>

Looks reasonable, so you can:
Acked-by: Darrick J. Wong <darrick.wong@oracle.com>

--D
> ---
>  lib/ext2fs/alloc.c | 33 +++++----------------------------
>  1 file changed, 5 insertions(+), 28 deletions(-)
> 
> diff --git a/lib/ext2fs/alloc.c b/lib/ext2fs/alloc.c
> index 42096a4..91637dd 100644
> --- a/lib/ext2fs/alloc.c
> +++ b/lib/ext2fs/alloc.c
> @@ -27,31 +27,15 @@
>  #include "ext2fs.h"
>  
>  /*
> - * Check for uninit block bitmaps and deal with them appropriately
> + * Clear the uninit block bitmap flag if necessary
>   */
> -static void check_block_uninit(ext2_filsys fs, ext2fs_block_bitmap map,
> -			       dgrp_t group)
> +static void clear_block_uninit(ext2_filsys fs, dgrp_t group)
>  {
> -	blk_t		i;
> -	blk64_t		blk, super_blk, old_desc_blk, new_desc_blk;
> -	int		old_desc_blocks;
> -
>  	if (!(EXT2_HAS_RO_COMPAT_FEATURE(fs->super,
>  					 EXT4_FEATURE_RO_COMPAT_GDT_CSUM)) ||
>  	    !(ext2fs_bg_flags_test(fs, group, EXT2_BG_BLOCK_UNINIT)))
>  		return;
>  
> -	blk = ext2fs_group_first_block2(fs, group);
> -
> -	ext2fs_super_and_bgd_loc2(fs, group, &super_blk,
> -				  &old_desc_blk, &new_desc_blk, 0);
> -
> -	if (fs->super->s_feature_incompat &
> -	    EXT2_FEATURE_INCOMPAT_META_BG)
> -		old_desc_blocks = fs->super->s_first_meta_bg;
> -	else
> -		old_desc_blocks = fs->desc_blocks + fs->super->s_reserved_gdt_blocks;
> -
>  	/* uninit block bitmaps are now initialized in read_bitmaps() */
>  
>  	ext2fs_bg_flags_clear(fs, group, EXT2_BG_BLOCK_UNINIT);
> @@ -78,10 +62,11 @@ static void check_inode_uninit(ext2_filsys fs, ext2fs_inode_bitmap map,
>  		ext2fs_fast_unmark_inode_bitmap2(map, ino);
>  
>  	ext2fs_bg_flags_clear(fs, group, EXT2_BG_INODE_UNINIT);
> +	/* Mimics what the kernel does */
> +	ext2fs_bg_flags_clear(fs, group, EXT2_BG_BLOCK_UNINIT);
>  	ext2fs_group_desc_csum_set(fs, group);
>  	ext2fs_mark_ib_dirty(fs);
>  	ext2fs_mark_super_dirty(fs);
> -	check_block_uninit(fs, fs->block_map, group);
>  }
>  
>  /*
> @@ -167,17 +152,9 @@ errcode_t ext2fs_new_block2(ext2_filsys fs, blk64_t goal,
>  	c_ratio = 1 << ext2fs_get_bitmap_granularity(map);
>  	if (c_ratio > 1)
>  		goal &= ~EXT2FS_CLUSTER_MASK(fs);
> -	check_block_uninit(fs, map,
> -			   (i - fs->super->s_first_data_block) /
> -			   EXT2_BLOCKS_PER_GROUP(fs->super));
>  	do {
> -		if (((i - fs->super->s_first_data_block) %
> -		     EXT2_BLOCKS_PER_GROUP(fs->super)) == 0)
> -			check_block_uninit(fs, map,
> -					   (i - fs->super->s_first_data_block) /
> -					   EXT2_BLOCKS_PER_GROUP(fs->super));
> -
>  		if (!ext2fs_fast_test_block_bitmap2(map, i)) {
> +			clear_block_uninit(fs, ext2fs_group_of_blk2(fs, i));
>  			*ret = i;
>  			return 0;
>  		}
> -- 
> 1.8.5.rc3.362.gdf10213
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 11/12] Add support for new compat feature "sparse_super2"
  2014-01-20  5:54 ` [PATCH 11/12] Add support for new compat feature "sparse_super2" Theodore Ts'o
@ 2014-01-20 21:49   ` Andreas Dilger
       [not found]   ` <alpine.DEB.2.10.1402051559390.16807@tglase.lan.tarent.de>
  1 sibling, 0 replies; 24+ messages in thread
From: Andreas Dilger @ 2014-01-20 21:49 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Ext4 Developers List

[-- Attachment #1: Type: text/plain, Size: 22407 bytes --]

On Jan 19, 2014, at 10:54 PM, Theodore Ts'o <tytso@mit.edu> wrote:
> In practice, it is **extremely** rare for users to try to use more
> than the first backup superblock located at the beginning of block
> group #1.  (i.e., at block number 32768 for file systems with a 4k
> block size).  This new compat feature restricts the backup superblock
> to block group #1 and the last block group in the file system.

This description is no longer completely accurate.  The actual locations
of the backup groups are stored in the superblock.  I'm not totally
against that, but it would mean the backups are harder to find if they
are not in the typical default groups (e.g. #1, #3, #5, etc).  It might
make sense to add that to the man page?

Otherwise, this looks good to me.

Cheers, Andreas

> Aside from reducing the overhead of the file system by a small number
> of blocks, by eliminating the rest of the backup superblocks, it
> allows us to have a much more flexible metadata layout.  For example,
> we can force all of the allocation bitmaps and inode table blocks to
> the beginning of the disk, which allows most of the disk to be
> exclusively used for contiguous data blocks.
> 
> This simplifies taking advantage of certain HDD specific features,
> such as Shingled Magnetic Recording (aka Shingled Drives), and the
> TCG's OPAL Storage Specification where having a simple mapping between
> LBA block ranges and the data blocks used by the file system can make
> life much simpler.
> 
> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
> ---
> debugfs/set_fields.c        |   2 +
> lib/e2p/feature.c           |   2 +
> lib/e2p/ls.c                |   8 ++
> lib/ext2fs/closefs.c        |  12 ++-
> lib/ext2fs/ext2_fs.h        |   4 +-
> lib/ext2fs/ext2fs.h         |   3 +-
> lib/ext2fs/initialize.c     |   2 +
> lib/ext2fs/res_gdt.c        |  13 ++++
> lib/ext2fs/swapfs.c         |   2 +
> lib/ext2fs/tst_super_size.c |   3 +-
> misc/ext4.5.in              |  11 +++
> misc/mke2fs.8.in            |   6 ++
> misc/mke2fs.c               |  30 +++++++-
> misc/mke2fs.conf.5.in       |   5 ++
> resize/online.c             |   8 ++
> resize/resize2fs.c          | 182 +++++++++++++++++++++++++++++++++++++++++++-
> 16 files changed, 284 insertions(+), 9 deletions(-)
> 
> diff --git a/debugfs/set_fields.c b/debugfs/set_fields.c
> index 9c3b000..ffbda74 100644
> --- a/debugfs/set_fields.c
> +++ b/debugfs/set_fields.c
> @@ -150,6 +150,8 @@ static struct field_set_info super_fields[] = {
> 	{ "usr_quota_inum", &set_sb.s_usr_quota_inum, NULL, 4, parse_uint },
> 	{ "grp_quota_inum", &set_sb.s_grp_quota_inum, NULL, 4, parse_uint },
> 	{ "overhead_blocks", &set_sb.s_overhead_blocks, NULL, 4, parse_uint },
> +	{ "backup_bgs", &set_sb.s_backup_bgs[0], NULL, 4, parse_uint,
> +	  FLAG_ARRAY, 2 },
> 	{ "checksum", &set_sb.s_checksum, NULL, 4, parse_uint },
> 	{ 0, 0, 0, 0 }
> };
> diff --git a/lib/e2p/feature.c b/lib/e2p/feature.c
> index 9691263..1d3e689 100644
> --- a/lib/e2p/feature.c
> +++ b/lib/e2p/feature.c
> @@ -43,6 +43,8 @@ static struct feature feature_list[] = {
> 			"lazy_bg" },
> 	{	E2P_FEATURE_COMPAT, EXT2_FEATURE_COMPAT_EXCLUDE_BITMAP,
> 			"snapshot_bitmap" },
> +	{	E2P_FEATURE_COMPAT, EXT4_FEATURE_COMPAT_SPARSE_SUPER2,
> +			"sparse_super2" },
> 
> 	{	E2P_FEATURE_RO_INCOMPAT, EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER,
> 			"sparse_super" },
> diff --git a/lib/e2p/ls.c b/lib/e2p/ls.c
> index 5b3d3c8..6f741c0 100644
> --- a/lib/e2p/ls.c
> +++ b/lib/e2p/ls.c
> @@ -368,6 +368,14 @@ void list_super2(struct ext2_super_block * sb, FILE *f)
> 			fprintf(f, "type %u\n", sb->s_jnl_backup_type);
> 		}
> 	}
> +	if (sb->s_backup_bgs[0] || sb->s_backup_bgs[1]) {
> +		fprintf(f, "Backup block groups:      ");
> +		if (sb->s_backup_bgs[0])
> +			fprintf(f, "%u ", sb->s_backup_bgs[0]);
> +		if (sb->s_backup_bgs[1])
> +			fprintf(f, "%u ", sb->s_backup_bgs[1]);
> +		fputc('\n', f);
> +	}
> 	if (sb->s_snapshot_inum) {
> 		fprintf(f, "Snapshot inode:           %u\n",
> 			sb->s_snapshot_inum);
> diff --git a/lib/ext2fs/closefs.c b/lib/ext2fs/closefs.c
> index 3e4af7f..4e91778 100644
> --- a/lib/ext2fs/closefs.c
> +++ b/lib/ext2fs/closefs.c
> @@ -35,8 +35,16 @@ static int test_root(unsigned int a, unsigned int b)
> 
> int ext2fs_bg_has_super(ext2_filsys fs, dgrp_t group)
> {
> -	if (!(fs->super->s_feature_ro_compat &
> -	      EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER) || group <= 1)
> +	if (group == 0)
> +		return 1;
> +	if (fs->super->s_feature_compat & EXT4_FEATURE_COMPAT_SPARSE_SUPER2) {
> +		if (group == fs->super->s_backup_bgs[0] ||
> +		    group == fs->super->s_backup_bgs[1])
> +			return 1;
> +		return 0;
> +	}
> +	if ((group <= 1) || !(fs->super->s_feature_ro_compat &
> +			      EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER))
> 		return 1;
> 	if (!(group & 1))
> 		return 0;
> diff --git a/lib/ext2fs/ext2_fs.h b/lib/ext2fs/ext2_fs.h
> index 930c2a3..d9e14d7 100644
> --- a/lib/ext2fs/ext2_fs.h
> +++ b/lib/ext2fs/ext2_fs.h
> @@ -645,7 +645,8 @@ struct ext2_super_block {
> 	__u32	s_usr_quota_inum;	/* inode number of user quota file */
> 	__u32	s_grp_quota_inum;	/* inode number of group quota file */
> 	__u32	s_overhead_blocks;	/* overhead blocks/clusters in fs */
> -	__u32   s_reserved[108];        /* Padding to the end of the block */
> +	__u32	s_backup_bgs[2];	/* If sparse_super2 enabled */
> +	__u32   s_reserved[106];        /* Padding to the end of the block */
> 	__u32	s_checksum;		/* crc32c(superblock) */
> };
> 
> @@ -696,6 +697,7 @@ struct ext2_super_block {
> #define EXT2_FEATURE_COMPAT_LAZY_BG		0x0040
> /* #define EXT2_FEATURE_COMPAT_EXCLUDE_INODE	0x0080 not used, legacy */
> #define EXT2_FEATURE_COMPAT_EXCLUDE_BITMAP	0x0100
> +#define EXT4_FEATURE_COMPAT_SPARSE_SUPER2	0x0200
> 
> 
> #define EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER	0x0001
> diff --git a/lib/ext2fs/ext2fs.h b/lib/ext2fs/ext2fs.h
> index 47340dd..efe0964 100644
> --- a/lib/ext2fs/ext2fs.h
> +++ b/lib/ext2fs/ext2fs.h
> @@ -550,7 +550,8 @@ typedef struct ext2_icount *ext2_icount_t;
> 					 EXT3_FEATURE_COMPAT_HAS_JOURNAL|\
> 					 EXT2_FEATURE_COMPAT_RESIZE_INODE|\
> 					 EXT2_FEATURE_COMPAT_DIR_INDEX|\
> -					 EXT2_FEATURE_COMPAT_EXT_ATTR)
> +					 EXT2_FEATURE_COMPAT_EXT_ATTR|\
> +					 EXT4_FEATURE_COMPAT_SPARSE_SUPER2)
> 
> /* This #ifdef is temporary until compression is fully supported */
> #ifdef ENABLE_COMPRESSION
> diff --git a/lib/ext2fs/initialize.c b/lib/ext2fs/initialize.c
> index 2db8b3c..dc6c419 100644
> --- a/lib/ext2fs/initialize.c
> +++ b/lib/ext2fs/initialize.c
> @@ -173,6 +173,8 @@ errcode_t ext2fs_initialize(const char *name, int flags,
> 	set_field(s_raid_stripe_width, 0);	/* default stripe width: 0 */
> 	set_field(s_log_groups_per_flex, 0);
> 	set_field(s_flags, 0);
> +	assign_field(s_backup_bgs[0]);
> +	assign_field(s_backup_bgs[1]);
> 	if (super->s_feature_incompat & ~EXT2_LIB_FEATURE_INCOMPAT_SUPP) {
> 		retval = EXT2_ET_UNSUPP_FEATURE;
> 		goto cleanup;
> diff --git a/lib/ext2fs/res_gdt.c b/lib/ext2fs/res_gdt.c
> index 6449228..e61c330 100644
> --- a/lib/ext2fs/res_gdt.c
> +++ b/lib/ext2fs/res_gdt.c
> @@ -31,6 +31,19 @@ static unsigned int list_backups(ext2_filsys fs, unsigned int *three,
> 	int mult = 3;
> 	unsigned int ret;
> 
> +	if (fs->super->s_feature_compat & EXT4_FEATURE_COMPAT_SPARSE_SUPER2) {
> +		if (*min == 1) {
> +			*min += 1;
> +			if (fs->super->s_backup_bgs[0])
> +				return fs->super->s_backup_bgs[0];
> +		}
> +		if (*min == 2) {
> +			*min += 1;
> +			if (fs->super->s_backup_bgs[1])
> +				return fs->super->s_backup_bgs[1];
> +		}
> +		return fs->group_desc_count;
> +	}
> 	if (!(fs->super->s_feature_ro_compat &
> 	      EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER)) {
> 		ret = *min;
> diff --git a/lib/ext2fs/swapfs.c b/lib/ext2fs/swapfs.c
> index 56c66cc..2a7b768 100644
> --- a/lib/ext2fs/swapfs.c
> +++ b/lib/ext2fs/swapfs.c
> @@ -99,6 +99,8 @@ void ext2fs_swap_super(struct ext2_super_block * sb)
> 	}
> 	for (; i < 17; i++)
> 		sb->s_jnl_blocks[i] = ext2fs_swab32(sb->s_jnl_blocks[i]);
> +	sb->s_backup_bgs[0] = ext2fs_swab32(sb->s_backup_bgs[0]);
> +	sb->s_backup_bgs[1] = ext2fs_swab32(sb->s_backup_bgs[1]);
> }
> 
> void ext2fs_swap_group_desc2(ext2_filsys fs, struct ext2_group_desc *gdp)
> diff --git a/lib/ext2fs/tst_super_size.c b/lib/ext2fs/tst_super_size.c
> index 85d87e1..f9cec8a 100644
> --- a/lib/ext2fs/tst_super_size.c
> +++ b/lib/ext2fs/tst_super_size.c
> @@ -135,7 +135,8 @@ int main(int argc, char **argv)
> 	check_field(s_usr_quota_inum, 4);
> 	check_field(s_grp_quota_inum, 4);
> 	check_field(s_overhead_blocks, 4);
> -	check_field(s_reserved, 108 * 4);
> +	check_field(s_backup_bgs, 8);
> +	check_field(s_reserved, 106 * 4);
> 	check_field(s_checksum, 4);
> 	do_field("Superblock end", 0, 0, cur_offset, 1024);
> #endif
> diff --git a/misc/ext4.5.in b/misc/ext4.5.in
> index fab1139..5ec39f5 100644
> --- a/misc/ext4.5.in
> +++ b/misc/ext4.5.in
> @@ -171,6 +171,17 @@ kernels from mounting file systems that they could not understand.
> .\" .br
> .\" .B Future feature, available in e2fsprogs 1.43-WIP
> .TP
> +.B sparse_super2
> +.br
> +This feature indicates that there will only at most two backup
> +superblock and block group descriptors.  The block groups used to store
> +the backup superblock and blockgroup descriptors are stored in the
> +superblock, but typically, one will be located at the beginning of block
> +group #1, and one in the last block group in the file system.  This is
> +feature is essentially a more extreme version of sparse_super and is
> +designed to allow the a much larger percentage of the disk to have
> +contiguous blocks available for data files.
> +.TP
> .B meta_bg
> .br
> This ext4 feature allows file systems to be resized on-line without explicitly
> diff --git a/misc/mke2fs.8.in b/misc/mke2fs.8.in
> index 67ddbf8..483fb1c 100644
> --- a/misc/mke2fs.8.in
> +++ b/misc/mke2fs.8.in
> @@ -274,6 +274,12 @@ small risk if the system crashes before the journal has been overwritten
> entirely one time.  If the option value is omitted, it defaults to 1 to
> enable lazy journal inode zeroing.
> .TP
> +.BI num_backup_sb= <0|1|2>
> +If the
> +.B sparse_super2
> +file system feature is enabled this option controls whether there will
> +be 0, 1, or 2 backup superblocks created in the file system.
> +.TP
> .BI root_owner [=uid:gid]
> Specify the numeric user and group ID of the root directory.  If no UID:GID
> is specified, use the user and group ID of the user running \fBmke2fs\fR.
> diff --git a/misc/mke2fs.c b/misc/mke2fs.c
> index 7daa87e..efb068a 100644
> --- a/misc/mke2fs.c
> +++ b/misc/mke2fs.c
> @@ -88,6 +88,7 @@ static int	discard = 1;	/* attempt to discard device before fs creation */
> static int	direct_io;
> static int	force;
> static int	noaction;
> +static int	num_backups = 2; /* number of backup bg's for sparse_super2 */
> static uid_t	root_uid;
> static gid_t	root_gid;
> int	journal_size;
> @@ -738,6 +739,21 @@ static void parse_extended_opts(struct ext2_super_block *param,
> 				r_usage++;
> 				continue;
> 			}
> +		} else if (strcmp(token, "num_backup_sb") == 0) {
> +			if (!arg) {
> +				r_usage++;
> +				badopt = token;
> +				continue;
> +			}
> +			num_backups = strtoul(arg, &p, 0);
> +			if (*p || num_backups > 2) {
> +				fprintf(stderr,
> +					_("Invalid # of backup "
> +					  "superbocks: %s\n"),
> +					arg);
> +				r_usage++;
> +				continue;
> +			}
> 		} else if (strcmp(token, "stride") == 0) {
> 			if (!arg) {
> 				r_usage++;
> @@ -894,6 +910,7 @@ static void parse_extended_opts(struct ext2_super_block *param,
> 			"\tis set off by an equals ('=') sign.\n\n"
> 			"Valid extended options are:\n"
> 			"\tmmp_update_interval=<interval>\n"
> +			"\tnum_backup_sb=<0|1|2>\n"
> 			"\tstride=<RAID per-disk data chunk in blocks>\n"
> 			"\tstripe-width=<RAID stride * data disks in blocks>\n"
> 			"\toffset=<offset to create the file system>\n"
> @@ -924,7 +941,8 @@ static __u32 ok_features[3] = {
> 	EXT3_FEATURE_COMPAT_HAS_JOURNAL |
> 		EXT2_FEATURE_COMPAT_RESIZE_INODE |
> 		EXT2_FEATURE_COMPAT_DIR_INDEX |
> -		EXT2_FEATURE_COMPAT_EXT_ATTR,
> +		EXT2_FEATURE_COMPAT_EXT_ATTR |
> +		EXT4_FEATURE_COMPAT_SPARSE_SUPER2,
> 	/* Incompat */
> 	EXT2_FEATURE_INCOMPAT_FILETYPE|
> 		EXT3_FEATURE_INCOMPAT_EXTENTS|
> @@ -1974,6 +1992,8 @@ profile_error:
> 	}
> #endif
> 
> +	num_backups = get_int_from_profile(fs_types, "num_backup_sb", 2);
> +
> 	blocksize = EXT2_BLOCK_SIZE(&fs_param);
> 
> 	/*
> @@ -2593,8 +2613,14 @@ int main (int argc, char *argv[])
> 		read_bb_file(fs, &bb_list, bad_blocks_filename);
> 	if (cflag)
> 		test_disk(fs, &bb_list);
> -
> 	handle_bad_blocks(fs, bb_list);
> +
> +	if (fs->super->s_feature_compat & EXT4_FEATURE_COMPAT_SPARSE_SUPER2) {
> +		if (fs->group_desc_count > 1 && num_backups >= 1)
> +			fs->super->s_backup_bgs[0] = 1;
> +		if (fs->group_desc_count > 2 && num_backups >= 2)
> +			fs->super->s_backup_bgs[1] = fs->group_desc_count - 1;
> +	}
> 	fs->stride = fs_stride = fs->super->s_raid_stride;
> 	if (!quiet)
> 		printf("%s", _("Allocating group tables: "));
> diff --git a/misc/mke2fs.conf.5.in b/misc/mke2fs.conf.5.in
> index 0625d0e..43bb91e 100644
> --- a/misc/mke2fs.conf.5.in
> +++ b/misc/mke2fs.conf.5.in
> @@ -357,6 +357,11 @@ initialization noticeably, but it requires the kernel to finish
> initializing the filesystem in the background when the filesystem is
> first mounted.
> .TP
> +.I num_backup_sb
> +This relation indicates whether file systems with the
> +.B sparse_super2
> +feature enabled should be created with 0, 1, or 2 backup superblocks.
> +.TP
> .I inode_ratio
> This relation specifies the default inode ratio if the user does not
> specify one on the command line.
> diff --git a/resize/online.c b/resize/online.c
> index defcac1..46d86b0 100644
> --- a/resize/online.c
> +++ b/resize/online.c
> @@ -76,6 +76,14 @@ errcode_t online_resize_fs(ext2_filsys fs, const char *mtpt,
> 			no_resize_ioctl = 1;
> 	}
> 
> +	if (EXT2_HAS_COMPAT_FEATURE(fs->super,
> +				    EXT4_FEATURE_COMPAT_SPARSE_SUPER2) &&
> +	    (access("/sys/fs/ext4/features/sparse_super2", R_OK) != 0)) {
> +		com_err(program_name, 0, _("kernel does not support online "
> +					   "resize with sparse_super2"));
> +		exit(1);
> +	}
> +
> 	printf(_("Filesystem at %s is mounted on %s; "
> 		 "on-line resizing required\n"), fs->device_name, mtpt);
> 
> diff --git a/resize/resize2fs.c b/resize/resize2fs.c
> index c4c2517..b0c4b5e 100644
> --- a/resize/resize2fs.c
> +++ b/resize/resize2fs.c
> @@ -53,6 +53,9 @@ static errcode_t ext2fs_calculate_summary_stats(ext2_filsys fs);
> static errcode_t fix_sb_journal_backup(ext2_filsys fs);
> static errcode_t mark_table_blocks(ext2_filsys fs,
> 				   ext2fs_block_bitmap bmap);
> +static errcode_t clear_sparse_super2_last_group(ext2_resize_t rfs);
> +static errcode_t reserve_sparse_super2_last_group(ext2_resize_t rfs,
> +						 ext2fs_block_bitmap meta_bmap);
> 
> /*
>  * Some helper CPP macros
> @@ -191,6 +194,10 @@ errcode_t resize_fs(ext2_filsys fs, blk64_t *new_size, int flags,
> 		goto errout;
> 	print_resource_track(rfs, &rtrack, fs->io);
> 
> +	retval = clear_sparse_super2_last_group(rfs);
> +	if (retval)
> +		goto errout;
> +
> 	rfs->new_fs->super->s_state &= ~EXT2_ERROR_FS;
> 	rfs->new_fs->flags &= ~EXT2_FLAG_MASTER_SB_ONLY;
> 
> @@ -460,6 +467,33 @@ retry:
> 	}
> 
> 	/*
> +	 * Update the location of the backup superblocks if the
> +	 * sparse_super2 feature is enabled.
> +	 */
> +	if (fs->super->s_feature_compat & EXT4_FEATURE_COMPAT_SPARSE_SUPER2) {
> +		dgrp_t last_bg = fs->group_desc_count - 1;
> +		dgrp_t old_last_bg = old_fs->group_desc_count - 1;
> +
> +		if (last_bg > old_last_bg) {
> +			if (old_fs->group_desc_count == 1)
> +				fs->super->s_backup_bgs[0] = 1;
> +			if (old_fs->group_desc_count == 1 &&
> +			    fs->super->s_backup_bgs[0])
> +				fs->super->s_backup_bgs[0] = last_bg;
> +			else if (fs->super->s_backup_bgs[1])
> +				fs->super->s_backup_bgs[1] = last_bg;
> +		} else if (last_bg < old_last_bg) {
> +			if (fs->super->s_backup_bgs[0] > last_bg)
> +				fs->super->s_backup_bgs[0] = 0;
> +			if (fs->super->s_backup_bgs[1] > last_bg)
> +				fs->super->s_backup_bgs[1] = 0;
> +			if (last_bg > 1 &&
> +			    old_fs->super->s_backup_bgs[1] == old_last_bg)
> +				fs->super->s_backup_bgs[1] = last_bg;
> +		}
> +	}
> +
> +	/*
> 	 * If we are shrinking the number of block groups, we're done
> 	 * and can exit now.
> 	 */
> @@ -615,14 +649,13 @@ errout:
>  */
> static errcode_t adjust_superblock(ext2_resize_t rfs, blk64_t new_size)
> {
> -	ext2_filsys fs;
> +	ext2_filsys	fs = rfs->new_fs;
> 	int		adj = 0;
> 	errcode_t	retval;
> 	blk64_t		group_block;
> 	unsigned long	i;
> 	unsigned long	max_group;
> 
> -	fs = rfs->new_fs;
> 	ext2fs_mark_super_dirty(fs);
> 	ext2fs_mark_bb_dirty(fs);
> 	ext2fs_mark_ib_dirty(fs);
> @@ -952,6 +985,10 @@ static errcode_t blocks_to_move(ext2_resize_t rfs)
> 		new_blocks = fs->desc_blocks + fs->super->s_reserved_gdt_blocks;
> 	}
> 
> +	retval = reserve_sparse_super2_last_group(rfs, meta_bmap);
> +	if (retval)
> +		goto errout;
> +
> 	if (old_blocks == new_blocks) {
> 		retval = 0;
> 		goto errout;
> @@ -1840,6 +1877,147 @@ errout:
> }
> 
> /*
> + * This function is used when expanding a file system.  It frees the
> + * superblock and block group descriptor blocks from the block group
> + * which is no longer the last block group.
> + */
> +static errcode_t clear_sparse_super2_last_group(ext2_resize_t rfs)
> +{
> +	ext2_filsys	fs = rfs->new_fs;
> +	ext2_filsys	old_fs = rfs->old_fs;
> +	errcode_t	retval;
> +	dgrp_t		old_last_bg = rfs->old_fs->group_desc_count - 1;
> +	dgrp_t		last_bg = fs->group_desc_count - 1;
> +	blk64_t		sb, old_desc;
> +	blk_t		num;
> +
> +	if (!(fs->super->s_feature_compat & EXT4_FEATURE_COMPAT_SPARSE_SUPER2))
> +		return 0;
> +
> +	if (last_bg <= old_last_bg)
> +		return 0;
> +
> +	if (fs->super->s_backup_bgs[0] == old_fs->super->s_backup_bgs[0] &&
> +	    fs->super->s_backup_bgs[1] == old_fs->super->s_backup_bgs[1])
> +		return 0;
> +
> +	if (old_fs->super->s_backup_bgs[0] != old_last_bg &&
> +	    old_fs->super->s_backup_bgs[1] != old_last_bg)
> +		return 0;
> +
> +	if (fs->super->s_backup_bgs[0] == old_last_bg ||
> +	    fs->super->s_backup_bgs[1] == old_last_bg)
> +		return 0;
> +
> +	retval = ext2fs_super_and_bgd_loc2(rfs->old_fs, old_last_bg,
> +					   &sb, &old_desc, NULL, &num);
> +	if (retval)
> +		return retval;
> +
> +	if (sb)
> +		ext2fs_unmark_block_bitmap2(fs->block_map, sb);
> +	if (old_desc)
> +		ext2fs_unmark_block_bitmap_range2(fs->block_map, old_desc, num);
> +	return 0;
> +}
> +
> +/*
> + * This function is used when shrinking a file system.  We need to
> + * utilize blocks from what will be the new last block group for the
> + * backup superblock and block group descriptor blocks.
> + * Unfortunately, those blocks may be used by other files or fs
> + * metadata blocks.  We need to mark them as being in use.
> + */
> +static errcode_t reserve_sparse_super2_last_group(ext2_resize_t rfs,
> +						 ext2fs_block_bitmap meta_bmap)
> +{
> +	ext2_filsys	fs = rfs->new_fs;
> +	ext2_filsys	old_fs = rfs->old_fs;
> +	errcode_t	retval;
> +	dgrp_t		old_last_bg = rfs->old_fs->group_desc_count - 1;
> +	dgrp_t		last_bg = fs->group_desc_count - 1;
> +	dgrp_t		g;
> +	blk64_t		blk, sb, old_desc;
> +	blk_t		i, num;
> +	int		realloc = 0;
> +
> +	if (!(fs->super->s_feature_compat & EXT4_FEATURE_COMPAT_SPARSE_SUPER2))
> +		return 0;
> +
> +	if (last_bg >= old_last_bg)
> +		return 0;
> +
> +	if (fs->super->s_backup_bgs[0] == old_fs->super->s_backup_bgs[0] &&
> +	    fs->super->s_backup_bgs[1] == old_fs->super->s_backup_bgs[1])
> +		return 0;
> +
> +	if (fs->super->s_backup_bgs[0] != last_bg &&
> +	    fs->super->s_backup_bgs[1] != last_bg)
> +		return 0;
> +
> +	if (old_fs->super->s_backup_bgs[0] == last_bg ||
> +	    old_fs->super->s_backup_bgs[1] == last_bg)
> +		return 0;
> +
> +	retval = ext2fs_super_and_bgd_loc2(rfs->new_fs, last_bg,
> +					   &sb, &old_desc, NULL, &num);
> +	if (retval)
> +		return retval;
> +
> +	if (!sb) {
> +		fputs(_("Should never happen!  No sb in last super_sparse bg?\n"),
> +		      stderr);
> +		exit(1);
> +	}
> +	if (old_desc != sb+1) {
> +		fputs(_("Should never happen!  Unexpected old_desc in "
> +			"super_sparse bg?\n"),
> +		      stderr);
> +		exit(1);
> +	}
> +	num = (old_desc) ? num : 1;
> +
> +	/* Reserve the backup blocks */
> +	ext2fs_mark_block_bitmap_range2(fs->block_map, sb, num);
> +
> +	for (g = 0; g < fs->group_desc_count; g++) {
> +		blk64_t mb;
> +
> +		mb = ext2fs_block_bitmap_loc(fs, g);
> +		if ((mb >= sb) && (mb < sb + num)) {
> +			ext2fs_block_bitmap_loc_set(fs, g, 0);
> +			realloc = 1;
> +		}
> +		mb = ext2fs_inode_bitmap_loc(fs, g);
> +		if ((mb >= sb) && (mb < sb + num)) {
> +			ext2fs_inode_bitmap_loc_set(fs, g, 0);
> +			realloc = 1;
> +		}
> +		mb = ext2fs_inode_table_loc(fs, g);
> +		if ((mb < sb + num) &&
> +		    (sb < mb + fs->inode_blocks_per_group)) {
> +			ext2fs_inode_table_loc_set(fs, g, 0);
> +			realloc = 1;
> +		}
> +		if (realloc) {
> +			retval = ext2fs_allocate_group_table(fs, g, 0);
> +			if (retval)
> +				return retval;
> +		}
> +	}
> +
> +	for (blk = sb, i = 0; i < num; blk++, i++) {
> +		if (ext2fs_test_block_bitmap2(old_fs->block_map, blk) &&
> +		    !ext2fs_test_block_bitmap2(meta_bmap, blk)) {
> +			ext2fs_mark_block_bitmap2(rfs->move_blocks, blk);
> +			rfs->needed_blocks++;
> +		}
> +		ext2fs_mark_block_bitmap2(rfs->reserve_blocks, blk);
> +	}
> +	return 0;
> +}
> +
> +/*
>  * Fix the resize inode
>  */
> static errcode_t fix_resize_inode(ext2_filsys fs)
> -- 
> 1.8.5.rc3.362.gdf10213
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 09/12] libext2: optimize ext2fs_new_block2()
  2014-01-20  5:54 ` [PATCH 09/12] libext2: optimize ext2fs_new_block2() Theodore Ts'o
@ 2014-01-20 21:52   ` Andreas Dilger
  2014-01-21 15:54     ` Theodore Ts'o
  0 siblings, 1 reply; 24+ messages in thread
From: Andreas Dilger @ 2014-01-20 21:52 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Ext4 Developers List

[-- Attachment #1: Type: text/plain, Size: 2826 bytes --]

On Jan 19, 2014, at 10:54 PM, Theodore Ts'o <tytso@mit.edu> wrote:
> If there are hundreds of thousands of blocks which are in use before
> the first free block, it is much, MUCH faster to use
> ext2fs_find_first_zero_block_bitmap2() instead of searching the
> allocation bitmap bit by bit.

Excellent.  The libext2fs block allocator has typically been unusable
for filesystems with allocated blocks, so it is good to see this.

This could be further improved by skipping full groups entirely, but
that doesn't detract from the benefits of this patch.

Cheers, Andreas

> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
> ---
> lib/ext2fs/alloc.c | 34 +++++++++++++++++-----------------
> 1 file changed, 17 insertions(+), 17 deletions(-)
> 
> diff --git a/lib/ext2fs/alloc.c b/lib/ext2fs/alloc.c
> index 91637dd..7dcfd3c 100644
> --- a/lib/ext2fs/alloc.c
> +++ b/lib/ext2fs/alloc.c
> @@ -137,8 +137,8 @@ errcode_t ext2fs_new_inode(ext2_filsys fs, ext2_ino_t dir,
> errcode_t ext2fs_new_block2(ext2_filsys fs, blk64_t goal,
> 			   ext2fs_block_bitmap map, blk64_t *ret)
> {
> -	blk64_t	i;
> -	int	c_ratio;
> +	errcode_t retval;
> +	blk64_t	b;
> 
> 	EXT2_CHECK_MAGIC(fs, EXT2_ET_MAGIC_EXT2FS_FILSYS);
> 
> @@ -148,21 +148,21 @@ errcode_t ext2fs_new_block2(ext2_filsys fs, blk64_t goal,
> 		return EXT2_ET_NO_BLOCK_BITMAP;
> 	if (!goal || (goal >= ext2fs_blocks_count(fs->super)))
> 		goal = fs->super->s_first_data_block;
> -	i = goal;
> -	c_ratio = 1 << ext2fs_get_bitmap_granularity(map);
> -	if (c_ratio > 1)
> -		goal &= ~EXT2FS_CLUSTER_MASK(fs);
> -	do {
> -		if (!ext2fs_fast_test_block_bitmap2(map, i)) {
> -			clear_block_uninit(fs, ext2fs_group_of_blk2(fs, i));
> -			*ret = i;
> -			return 0;
> -		}
> -		i = (i + c_ratio) & ~(c_ratio - 1);
> -		if (i >= ext2fs_blocks_count(fs->super))
> -			i = fs->super->s_first_data_block;
> -	} while (i != goal);
> -	return EXT2_ET_BLOCK_ALLOC_FAIL;
> +	goal &= ~EXT2FS_CLUSTER_MASK(fs);
> +
> +	retval = ext2fs_find_first_zero_block_bitmap2(map,
> +			goal, ext2fs_blocks_count(fs->super) - 1, &b);
> +	if ((retval == ENOENT) && (goal != fs->super->s_first_data_block))
> +		retval = ext2fs_find_first_zero_block_bitmap2(map,
> +			fs->super->s_first_data_block, goal - 1, &b);
> +	if (retval == ENOENT)
> +		return EXT2_ET_BLOCK_ALLOC_FAIL;
> +	if (retval)
> +		return retval;
> +
> +	clear_block_uninit(fs, ext2fs_group_of_blk2(fs, b));
> +	*ret = b;
> +	return 0;
> }
> 
> errcode_t ext2fs_new_block(ext2_filsys fs, blk_t goal,
> -- 
> 1.8.5.rc3.362.gdf10213
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 12/12] mke2fs: allow metadata blocks to be at the beginning of the file system
  2014-01-20  5:54 ` [PATCH 12/12] mke2fs: allow metadata blocks to be at the beginning of the file system Theodore Ts'o
  2014-01-20 16:30   ` Theodore Ts'o
@ 2014-01-20 23:25   ` Andreas Dilger
  2014-01-21  6:23     ` Theodore Ts'o
  1 sibling, 1 reply; 24+ messages in thread
From: Andreas Dilger @ 2014-01-20 23:25 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Ext4 Developers List

[-- Attachment #1: Type: text/plain, Size: 9483 bytes --]

On Jan 19, 2014, at 10:54 PM, Theodore Ts'o <tytso@mit.edu> wrote:
> Add the extended options packed_meta_blocks and journal_location_front
> which causes mke2fs to place the metadata blocks at the beginning of
> the file system.

The "packed_meta_blocks" appears to be equivalent to setting flex_bg
to some large enough factor that all the block and inode bitmaps are
at the start of the filesystem?  It would probably be better to align
the inode and block bitmaps and inode table on a multiple of
s_raid_stride (will this be used to align on SMR erase blocks?) so
that rewrites are at least somewhat efficient and aligned?  That would
also allow reserving some room in the flex_bg packing to allow for
filesystem resizing.

It would also be useful to allow setting the journal goal block
directly, instead of journal_location_front only allowing to specify
goal == 0 (i.e. add "-E journal_start_goal=N" instead of adding
"-E journal_location_front", which implied by packed_meta_blocks).
I've wanted to be able to do this for a long time, but the stumbling
block is that write_journal_inode() doesn't have any parameter to
specify the goal journal block without storing it in the superblock.
I suppose it would be possible to pass the journal goal block in
s_jnl_blocks[0..1] or something?

Cheers, Andreas

> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
> ---
> lib/ext2fs/ext2fs.h    |  1 +
> lib/ext2fs/mkjournal.c |  5 +++-
> misc/mke2fs.8.in       | 14 +++++++++++
> misc/mke2fs.c          | 65 +++++++++++++++++++++++++++++++++++++++++++++++++-
> misc/mke2fs.conf.5.in  |  4 ++++
> 5 files changed, 87 insertions(+), 2 deletions(-)
> 
> diff --git a/lib/ext2fs/ext2fs.h b/lib/ext2fs/ext2fs.h
> index efe0964..fed6410 100644
> --- a/lib/ext2fs/ext2fs.h
> +++ b/lib/ext2fs/ext2fs.h
> @@ -204,6 +204,7 @@ typedef struct ext2_file *ext2_file_t;
> #define EXT2_MKJOURNAL_V1_SUPER	0x0000001 /* create V1 superblock (deprecated) */
> #define EXT2_MKJOURNAL_LAZYINIT	0x0000002 /* don't zero journal inode before use*/
> #define EXT2_MKJOURNAL_NO_MNT_CHECK 0x0000004 /* don't check mount status */
> +#define EXT2_MKJOURNAL_LOCATION_FRONT 0x0000004 /* journal at the beginning */
> 
> struct opaque_ext2_group_desc;
> 
> diff --git a/lib/ext2fs/mkjournal.c b/lib/ext2fs/mkjournal.c
> index d09c458..aa6d9af 100644
> --- a/lib/ext2fs/mkjournal.c
> +++ b/lib/ext2fs/mkjournal.c
> @@ -360,7 +360,10 @@ static errcode_t write_journal_inode(ext2_filsys fs, ext2_ino_t journal_ino,
> 		    ext2fs_bg_free_blocks_count(fs, group))
> 			group = i;
> 
> -	es.goal = ext2fs_group_first_block2(fs, group);
> +	if (flags & EXT2_MKJOURNAL_LOCATION_FRONT)
> +		es.goal = 0;
> +	else
> +		es.goal = ext2fs_group_first_block2(fs, group);
> 	retval = ext2fs_block_iterate3(fs, journal_ino, BLOCK_FLAG_APPEND,
> 				       0, mkjournal_proc, &es);
> 	if (es.err) {
> diff --git a/misc/mke2fs.8.in b/misc/mke2fs.8.in
> index 483fb1c..9f8c9f6 100644
> --- a/misc/mke2fs.8.in
> +++ b/misc/mke2fs.8.in
> @@ -280,6 +280,20 @@ If the
> file system feature is enabled this option controls whether there will
> be 0, 1, or 2 backup superblocks created in the file system.
> .TP
> +.B packed_meta_blocks\fR[\fB= \fI<0 to disable, 1 to enable>\fR]
> +Place the allocation bitmaps and the inode table at the beginning of the
> +disk.  This option requires that the flex_bg file system feature to be
> +enabled in order for it to have effect, and also implies the
> +extended option
> +.IR journal_location_front .
> +This option is useful for flash devices that use SLC flash at the beginning of
> +the disk.   It also maximizes the range of contiguous data blocks, which
> +can be useful for certain specialized use cases, such as supported
> +Shingled Drives.
> +.TP
> +.B journal_location_front\fR[\fB= \fI<0 to disable, 1 to enable>\fR]
> +Place the journal at the beginning of the file system.
> +.TP
> .BI root_owner [=uid:gid]
> Specify the numeric user and group ID of the root directory.  If no UID:GID
> is specified, use the user and group ID of the user running \fBmke2fs\fR.
> diff --git a/misc/mke2fs.c b/misc/mke2fs.c
> index efb068a..363e96b 100644
> --- a/misc/mke2fs.c
> +++ b/misc/mke2fs.c
> @@ -94,6 +94,7 @@ static gid_t	root_gid;
> int	journal_size;
> int	journal_flags;
> static int	lazy_itable_init;
> +static int	packed_meta_blocks;
> static char	*bad_blocks_filename = NULL;
> static __u32	fs_stride;
> static int	quotatype = -1;  /* Initialize both user and group quotas by default */
> @@ -310,6 +311,40 @@ _("Warning: the backup superblock/group descriptors at block %u contain\n"
> 	ext2fs_badblocks_list_iterate_end(bb_iter);
> }
> 
> +static errcode_t packed_allocate_tables(ext2_filsys fs)
> +{
> +	errcode_t	retval;
> +	dgrp_t		i;
> +	blk64_t		goal = 0;
> +
> +	for (i = 0; i < fs->group_desc_count; i++) {
> +		retval = ext2fs_new_block2(fs, goal, NULL, &goal);
> +		if (retval)
> +			return retval;
> +		ext2fs_block_alloc_stats2(fs, goal, +1);
> +		ext2fs_block_bitmap_loc_set(fs, i, goal);
> +	}
> +	for (i = 0; i < fs->group_desc_count; i++) {
> +		retval = ext2fs_new_block2(fs, goal, NULL, &goal);
> +		if (retval)
> +			return retval;
> +		ext2fs_block_alloc_stats2(fs, goal, +1);
> +		ext2fs_inode_bitmap_loc_set(fs, i, goal);
> +	}
> +	for (i = 0; i < fs->group_desc_count; i++) {
> +		blk64_t end = ext2fs_blocks_count(fs->super) - 1;
> +		retval = ext2fs_get_free_blocks2(fs, goal, end,
> +						 fs->inode_blocks_per_group,
> +						 fs->block_map, &goal);
> +		if (retval)
> +			return retval;
> +		ext2fs_block_alloc_stats_range(fs, goal,
> +					       fs->inode_blocks_per_group, +1);
> +		ext2fs_inode_table_loc_set(fs, i, goal);
> +	}
> +	return 0;
> +}
> +
> static void write_inode_tables(ext2_filsys fs, int lazy_flag, int itable_zeroed)
> {
> 	errcode_t	retval;
> @@ -712,6 +747,14 @@ static void parse_extended_opts(struct ext2_super_block *param,
> 				continue;
> 			}
> 			param->s_desc_size = desc_size;
> +		} else if (strcmp(token, "journal_location_front") == 0) {
> +			unsigned long front = 1;
> +			if (arg)
> +				front = strtoul(arg, &p, 0);
> +			if (front)
> +				journal_flags |= EXT2_MKJOURNAL_LOCATION_FRONT;
> +			else
> +				journal_flags &= ~EXT2_MKJOURNAL_LOCATION_FRONT;
> 		} else if (strcmp(token, "offset") == 0) {
> 			if (!arg) {
> 				r_usage++;
> @@ -754,6 +797,15 @@ static void parse_extended_opts(struct ext2_super_block *param,
> 				r_usage++;
> 				continue;
> 			}
> +		} else if (strcmp(token, "packed_meta_blocks") == 0) {
> +			if (arg)
> +				packed_meta_blocks = strtoul(arg, &p, 0);
> +			else
> +				packed_meta_blocks = 1;
> +			if (packed_meta_blocks)
> +				journal_flags |= EXT2_MKJOURNAL_LOCATION_FRONT;
> +			else
> +				journal_flags &= ~EXT2_MKJOURNAL_LOCATION_FRONT;
> 		} else if (strcmp(token, "stride") == 0) {
> 			if (!arg) {
> 				r_usage++;
> @@ -915,8 +967,10 @@ static void parse_extended_opts(struct ext2_super_block *param,
> 			"\tstripe-width=<RAID stride * data disks in blocks>\n"
> 			"\toffset=<offset to create the file system>\n"
> 			"\tresize=<resize maximum size in blocks>\n"
> +			"\tpacked_meta_blocks=<0 to disable, 1 to enable>\n"
> 			"\tlazy_itable_init=<0 to disable, 1 to enable>\n"
> 			"\tlazy_journal_init=<0 to disable, 1 to enable>\n"
> +			"\tjournal_location_front=<0 to disable, 1 to enable>\n"
> 			"\troot_uid=<uid of root directory>\n"
> 			"\troot_gid=<gid of root directory>\n"
> 			"\ttest_fs\n"
> @@ -2030,6 +2084,11 @@ profile_error:
> 					       EXT2_MKJOURNAL_LAZYINIT : 0;
> 	journal_flags |= EXT2_MKJOURNAL_NO_MNT_CHECK;
> 
> +	packed_meta_blocks = get_bool_from_profile(fs_types,
> +						   "packed_meta_blocks", 0);
> +	if (packed_meta_blocks)
> +		journal_flags |= EXT2_MKJOURNAL_LOCATION_FRONT;
> +
> 	/* Get options from profile */
> 	for (cpp = fs_types; *cpp; cpp++) {
> 		tmp = NULL;
> @@ -2624,7 +2683,11 @@ int main (int argc, char *argv[])
> 	fs->stride = fs_stride = fs->super->s_raid_stride;
> 	if (!quiet)
> 		printf("%s", _("Allocating group tables: "));
> -	retval = ext2fs_allocate_tables(fs);
> +	if ((fs->super->s_feature_incompat & EXT4_FEATURE_INCOMPAT_FLEX_BG) &&
> +	    packed_meta_blocks)
> +		retval = packed_allocate_tables(fs);
> +	else
> +		retval = ext2fs_allocate_tables(fs);
> 	if (retval) {
> 		com_err(program_name, retval, "%s",
> 			_("while trying to allocate filesystem tables"));
> diff --git a/misc/mke2fs.conf.5.in b/misc/mke2fs.conf.5.in
> index 43bb91e..1aba87b 100644
> --- a/misc/mke2fs.conf.5.in
> +++ b/misc/mke2fs.conf.5.in
> @@ -362,6 +362,10 @@ This relation indicates whether file systems with the
> .B sparse_super2
> feature enabled should be created with 0, 1, or 2 backup superblocks.
> .TP
> +.I packed_meta_blocks
> +This boolean relation specifes whether the allocation bitmaps, inode
> +table, and journal should be located at the beginning of the file system.
> +.TP
> .I inode_ratio
> This relation specifies the default inode ratio if the user does not
> specify one on the command line.
> -- 
> 1.8.5.rc3.362.gdf10213
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 12/12] mke2fs: allow metadata blocks to be at the beginning of the file system
  2014-01-20 23:25   ` Andreas Dilger
@ 2014-01-21  6:23     ` Theodore Ts'o
  2014-01-23 21:28       ` Andreas Dilger
  0 siblings, 1 reply; 24+ messages in thread
From: Theodore Ts'o @ 2014-01-21  6:23 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Ext4 Developers List

On Mon, Jan 20, 2014 at 04:25:23PM -0700, Andreas Dilger wrote:
> The "packed_meta_blocks" appears to be equivalent to setting flex_bg
> to some large enough factor that all the block and inode bitmaps are
> at the start of the filesystem?

It's not the same thing, unfortunately, because of how we make room
for the file system to grow and so require extra room for potential
new block groups.  Try running "mke2fs -t ext4 -G 262144 /tmp/foo.img
1T" and look at the gaps between the allocation bitmaps and the inode
table using dumpe2fs.

> It would probably be better to align
> the inode and block bitmaps and inode table on a multiple of
> s_raid_stride (will this be used to align on SMR erase blocks?) so
> that rewrites are at least somewhat efficient and aligned?  That would
> also allow reserving some room in the flex_bg packing to allow for
> filesystem resizing.

Given these blocks are written using random 4k writes, I don't think
any kind of alignment is going to be worth it.

> It would also be useful to allow setting the journal goal block
> directly, instead of journal_location_front only allowing to specify
> goal == 0 (i.e. add "-E journal_start_goal=N" instead of adding
> "-E journal_location_front", which implied by packed_meta_blocks).
> I've wanted to be able to do this for a long time, but the stumbling
> block is that write_journal_inode() doesn't have any parameter to
> specify the goal journal block without storing it in the superblock.
> I suppose it would be possible to pass the journal goal block in
> s_jnl_blocks[0..1] or something?

Hmm, yes, adding a flag which indicates that the starting block should
be passed in s_jnl_blocks[0] is a good idea.

					- Ted

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 09/12] libext2: optimize ext2fs_new_block2()
  2014-01-20 21:52   ` Andreas Dilger
@ 2014-01-21 15:54     ` Theodore Ts'o
  0 siblings, 0 replies; 24+ messages in thread
From: Theodore Ts'o @ 2014-01-21 15:54 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Ext4 Developers List

On Mon, Jan 20, 2014 at 02:52:43PM -0700, Andreas Dilger wrote:
> On Jan 19, 2014, at 10:54 PM, Theodore Ts'o <tytso@mit.edu> wrote:
> > If there are hundreds of thousands of blocks which are in use before
> > the first free block, it is much, MUCH faster to use
> > ext2fs_find_first_zero_block_bitmap2() instead of searching the
> > allocation bitmap bit by bit.
> 
> Excellent.  The libext2fs block allocator has typically been unusable
> for filesystems with allocated blocks, so it is good to see this.

Yes, all of these patches should fix the following e-mail message
question which you raised a while back:

------------
Date: Fri, 28 May 2010 13:25:45 -0600
From: Andreas Dilger <andreas.dilger@oracle.com>
To: "linux-ext4@vger.kernel.org development" <linux-ext4@vger.kernel.org>
Cc: Zhiqi Tao <zhiqi.tao@oracle.com>
Subject: Huge flex_bg count kills mke2fs
X-Mailer: Apple Mail (2.1078)

We're trying to put all of the static ext4 metadata at the beginning
of the disk to see whether this gives us a performance improvement
(avoid seeking during e2fsck, avoid free space fragmentation).  This
is on a filesystem somewhat smaller than 16TB, and the flex_bg count
131072 would seem large enough to put all of groups into a single flex
group.

However, running the below command spins forever, apparently trying to
allocate the static metadata:

mke2fs -j -b 4096 -G 131072 -J size=1024 -i 65536 \
    -O flex_bg,uninit_bg,extents -I 256 -F /dev/vgost0/lvost0 4294965248
...
-----------

(I found it a while today when I was doing some mail searching.)

> This could be further improved by skipping full groups entirely, but
> that doesn't detract from the benefits of this patch.

Actually, given that we're using a red-black tree for the block
bitmaps these days, it's actually much more efficient than checking
the block counts to see which block groups might be free.  If the
first 4,000 block groups have been fully allocated, there will be a
single entry node in the red-black tree.  :-)

						- Ted

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 12/12] mke2fs: allow metadata blocks to be at the beginning of the file system
  2014-01-21  6:23     ` Theodore Ts'o
@ 2014-01-23 21:28       ` Andreas Dilger
  0 siblings, 0 replies; 24+ messages in thread
From: Andreas Dilger @ 2014-01-23 21:28 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Ext4 Developers List

[-- Attachment #1: Type: text/plain, Size: 5002 bytes --]

On Jan 20, 2014, at 11:23 PM, Theodore Ts'o <tytso@mit.edu> wrote:
> On Mon, Jan 20, 2014 at 04:25:23PM -0700, Andreas Dilger wrote:
>> The "packed_meta_blocks" appears to be equivalent to setting flex_bg
>> to some large enough factor that all the block and inode bitmaps are
>> at the start of the filesystem?
> 
> It's not the same thing, unfortunately, because of how we make room
> for the file system to grow and so require extra room for potential
> new block groups.  Try running "mke2fs -t ext4 -G 262144 /tmp/foo.img
> 1T" and look at the gaps between the allocation bitmaps and the inode
> table using dumpe2fs.

I'd consider that a potential benefit, since it would allow the
filesystem to be resized at a later point.  However, it does mean
that if there is a big difference between the original filesystem
size and the maximum size that the intervening blocks would need
to be reserved blocks so they do not get allocated in the meantime
(e.g. allocated by the resize inode or something?).

I actually tested with "mke2fs -t ext4 -G 131072 -i 1048576"
(16TB max size, 1MB per inode, mke2fs 1.42.7.wc1) to see what the
on-disk layout is like.  It seems a bug in inode table allocation
gives a bad layout.  It starts out as expected, planning for 4 full
groups of block bitmaps (4 groups * 32K blocks/group = 131072 bitmaps)
from group #0 to #3, then 4 full groups of inode bitmaps from group
#4 to #7, and finally inode tables starting at group #8 onward (but
note I don't think it takes backup group descriptors into account):

  Group 0: (Blocks 0-32767)
    Primary superblock at 0, Group descriptors at 1-2048
    Block bitmap 2049 (+2049), Inode bitmap at 133121 (bg #4+2049)
    Inode table 264193-264200 (bg #8+2049)

However, things go badly when the inode tables fill up their first
block group.  For some reason, the inode table allocations wrap
around to group #0 instead of continuing in group #9, which screws
up the block bitmap allocations and they start getting interleaved
with the inode table:

  Group 3838: (Blocks 125763584-125796351) [INODE_UNINIT, BLOCK_UNINIT]
    Block bitmap 5887 (bg #0+5887), Inode bitmap 136959 (bg #4+5887)
    Inode table 294897-294904 (bg #8 + 32753)
  Group 3839: (Blocks 125796352-125829119) [INODE_UNINIT, BLOCK_UNINIT]
    Block bitmap 5888 (bg #0+5888), Inode bitmap 136960 (bg #4+5888)
    Inode table 5889-5896 (bg #0 + 5889)
  Group 3840: (Blocks 125829120-125861887) [INODE_UNINIT, BLOCK_UNINIT]
    Block bitmap 5897 (bg #0+5897), Inode bitmap 136961 (bg #4+5889)
    Inode table 5898-5905 (bg #0 + 5898)

This eventually screws up all of the flex_bg allocations and it is
not really much better than non-flex_bg for the rest of the filesystem.
It looks like the problem for the group #3839 inode table is because
it runs into the backup superblock and group descriptors in group #9:

  Group 9: (Blocks 294912-327679) [INODE_UNINIT]
    Backup superblock at 294912, Group descriptors at 294913-296960

I tried digging through the code to see where it went wrong, and it
looks like ext2fs_allocate_group_table->flexbg_offset().tic code is
failing when it doesn't find an empty range, and then resets the
start block to the first group in the flex_bg (== 0):

57      if (start_blk && ext2fs_test_block_bitmap_range2(bmap, start_blk,
58                                                       elem_size))
59              return start_blk;
60
61      start_blk = ext2fs_group_first_block2(fs, flexbg_size * flexbg);

It would be interesting to test this in conjunction with sparse_super2
and put the first backup group descriptor in group #49 (after the inode
table) and see if it can get an ideal flex_bg layout.

I might have a patch that can fix this without too much effort, not
sure yet.

Cheers, Andreas

>> It would probably be better to align
>> the inode and block bitmaps and inode table on a multiple of
>> s_raid_stride (will this be used to align on SMR erase blocks?) so
>> that rewrites are at least somewhat efficient and aligned?  That would
>> also allow reserving some room in the flex_bg packing to allow for
>> filesystem resizing.
> 
> Given these blocks are written using random 4k writes, I don't think
> any kind of alignment is going to be worth it.
> 
>> It would also be useful to allow setting the journal goal block
>> directly, instead of journal_location_front only allowing to specify
>> goal == 0 (i.e. add "-E journal_start_goal=N" instead of adding
>> "-E journal_location_front", which implied by packed_meta_blocks).
>> I've wanted to be able to do this for a long time, but the stumbling
>> block is that write_journal_inode() doesn't have any parameter to
>> specify the goal journal block without storing it in the superblock.
>> I suppose it would be possible to pass the journal goal block in
>> s_jnl_blocks[0..1] or something?
> 
> Hmm, yes, adding a flag which indicates that the starting block should
> be passed in s_jnl_blocks[0] is a good idea.
> 
> 					- Ted

Cheers, Andreas

[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 11/12] Add support for new compat feature "sparse_super2"
       [not found]   ` <alpine.DEB.2.10.1402051559390.16807@tglase.lan.tarent.de>
@ 2014-02-05 16:17     ` Theodore Ts'o
  0 siblings, 0 replies; 24+ messages in thread
From: Theodore Ts'o @ 2014-02-05 16:17 UTC (permalink / raw)
  To: Thorsten Glaser

On Wed, Feb 05, 2014 at 04:01:26PM +0100, Thorsten Glaser wrote:
> 
> Can we have this for tune2fs? At least a way to turn it on, so that
> the next e2fsck or resize2fs will actually get rid of the now no
> longer needed copies?

Sure, it's possible.  It's not like there are *that* many copies so it
won't actually save that much disk space.  The main goal was not save
space, since the amount of disk space for the extra copies is a very
tiny fraction of the disk space in a big file system.

The main motivation for sparse superblocks was to allow me to make
file systems that are where the data block area is perfectly
contiguous which is useful for various specialized use cases.  For
example, if you are a huge cloud company who is absolutely fanatical
about controlling 99.9 percentile latency, and you want to create data
block files which are completely contiguous.  The sparse_super2
feature will also be useful for supporting Shingled Magnetic Recording
(SMR) disks where we care about having large, contiguous files be
aligned on 256 MB SMR zones, in the case where the userspace
application is SMR-aware.  (SMR drives are only now starting to be
available in prototype form from HDD vendors, and the T10
specifications haven't been finalized yet, so these are still very
early days for SMR support.)

> What kernels support this option? (Sorry if this was already said…
> I mostly noticed this in today’s apt-listchanges output only, and
> DuckDuckGo found virtually nothing, and Google was only marginally
> more helpful.) Is this ext4fs only?

The sparse_super2 feature is a compat feature, so it doesn't have any
kernel dependencies, and no, it's not ext4 only.  Any modern kernel
will have no problem supporting it.  Some old 2.2 or 2.4 kernels with
ext3 might complain because blocks which they think should be reserved
for backup superblocks will not be, but that hasn't been true for
quite some time.

One restriction is that kernels currently do not support online resize
of sparse_super2 file systems.  That will come eventually, but the
intended use case for this feature was primarily for single disk file
systems, so online resize wasn't something that is high on the
priority list.

Cheers,

					- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 07/12] libext2fs: add ext2fs_block_alloc_stats_range()
  2014-01-20  5:54 ` [PATCH 07/12] libext2fs: add ext2fs_block_alloc_stats_range() Theodore Ts'o
@ 2014-02-13 21:50   ` Darrick J. Wong
  0 siblings, 0 replies; 24+ messages in thread
From: Darrick J. Wong @ 2014-02-13 21:50 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Ext4 Developers List

On Mon, Jan 20, 2014 at 12:54:09AM -0500, Theodore Ts'o wrote:
> This function is more efficient than using ext2fs_block_alloc_stats2()
> for each block in a range.  The efficiencies come from being able to
> set a block range in the block bitmap at once, and from being update
> the block group descriptors once per block group.  Especially now that
> we are checksuming the block group descriptors, and we are using red
> black trees for the allocation bitmaps, these changes can make a huge
> difference in the CPU time used by mke2fs when creating very large
> file systems.
> 
> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
> ---
>  lib/ext2fs/alloc_stats.c | 41 +++++++++++++++++++++++++++++++++++++++++
>  lib/ext2fs/ext2fs.h      |  2 ++
>  2 files changed, 43 insertions(+)
> 
> diff --git a/lib/ext2fs/alloc_stats.c b/lib/ext2fs/alloc_stats.c
> index adec363..7919c09 100644
> --- a/lib/ext2fs/alloc_stats.c
> +++ b/lib/ext2fs/alloc_stats.c
> @@ -106,3 +106,44 @@ void ext2fs_set_block_alloc_stats_callback(ext2_filsys fs,
>  
>  	fs->block_alloc_stats = func;
>  }
> +
> +void ext2fs_block_alloc_stats_range(ext2_filsys fs, blk64_t blk,
> +				    blk_t num, int inuse)
> +{
> +#ifndef OMIT_COM_ERR
> +	if (blk + num >= ext2fs_blocks_count(fs->super)) {
> +		com_err("ext2fs_block_alloc_stats_range", 0,
> +			"Illegal block range: %llu (%u) ",
> +			(unsigned long long) blk, num);
> +		return;
> +	}
> +#endif
> +	if (inuse == 0)
> +		return;
> +	if (inuse > 0) {
> +		ext2fs_mark_block_bitmap_range2(fs->block_map, blk, num);
> +		inuse = 1;
> +	} else {
> +		ext2fs_unmark_block_bitmap_range2(fs->block_map, blk, num);
> +		inuse = -1;
> +	}
> +	while (num) {
> +		int group = ext2fs_group_of_blk2(fs, blk);
> +		blk64_t last_blk = ext2fs_group_last_block2(fs, group);
> +		blk_t n = num;
> +
> +		if (blk + num > last_blk)
> +			n = last_blk - blk + 1;
> +
> +		ext2fs_bg_free_blocks_count_set(fs, group,
> +			ext2fs_bg_free_blocks_count(fs, group) -
> +			inuse*n/EXT2FS_CLUSTER_RATIO(fs));
> +		ext2fs_bg_flags_clear(fs, group, EXT2_BG_BLOCK_UNINIT);
> +		ext2fs_group_desc_csum_set(fs, group);
> +		ext2fs_free_blocks_count_add(fs->super, -inuse * n);

When I use this function, e2fsck complains:
"Free blocks count wrong (124554823344, counted=771760)."

It looks like "-inuse * n" is an 32bit quantity that isn't correctly
sign-extended to the blk64_t that ext2fs_free_blocks_count_add() requires.
A trivial fix is to change n to blk64_t to force a conversion.

Will send a patch soon.

(gcc 4.6.3, Ubuntu 12.04)

--D

> +		blk += n;
> +		num -= n;
> +	}
> +	ext2fs_mark_super_dirty(fs);
> +	ext2fs_mark_bb_dirty(fs);
> +}
> diff --git a/lib/ext2fs/ext2fs.h b/lib/ext2fs/ext2fs.h
> index 1e07f88..47340dd 100644
> --- a/lib/ext2fs/ext2fs.h
> +++ b/lib/ext2fs/ext2fs.h
> @@ -683,6 +683,8 @@ void ext2fs_inode_alloc_stats2(ext2_filsys fs, ext2_ino_t ino,
>  			       int inuse, int isdir);
>  void ext2fs_block_alloc_stats(ext2_filsys fs, blk_t blk, int inuse);
>  void ext2fs_block_alloc_stats2(ext2_filsys fs, blk64_t blk, int inuse);
> +void ext2fs_block_alloc_stats_range(ext2_filsys fs, blk64_t blk,
> +				    blk_t num, int inuse);
>  
>  /* alloc_tables.c */
>  extern errcode_t ext2fs_allocate_tables(ext2_filsys fs);
> -- 
> 1.8.5.rc3.362.gdf10213
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2014-02-13 21:50 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-01-20  5:54 [PATCH 00/12] e2fsprogs mke2fs optimizations and new features Theodore Ts'o
2014-01-20  5:54 ` [PATCH 01/12] libext2fs: fix off-by-one bug in ext2fs_extent_insert() Theodore Ts'o
2014-01-20  5:54 ` [PATCH 02/12] libext2fs: clean up generic handling of ext2fs_find_first_{set,zero}_*() Theodore Ts'o
2014-01-20  5:54 ` [PATCH 03/12] libext2fs: build tst_bitmaps with rep invariants checking enabled Theodore Ts'o
2014-01-20  5:54 ` [PATCH 04/12] libext: optimize find_first_set() for bitarray-based bitmaps Theodore Ts'o
2014-01-20  5:54 ` [PATCH 05/12] libext2fs: optimize find_first_{zero,set}() for red-black tree based bitmaps Theodore Ts'o
2014-01-20  5:54 ` [PATCH 06/12] libext2fs: further clean up and rename check_block_uninit Theodore Ts'o
2014-01-20 20:17   ` Darrick J. Wong
2014-01-20  5:54 ` [PATCH 07/12] libext2fs: add ext2fs_block_alloc_stats_range() Theodore Ts'o
2014-02-13 21:50   ` Darrick J. Wong
2014-01-20  5:54 ` [PATCH 08/12] libext2fs: optimize ext2fs_allocate_group_table() Theodore Ts'o
2014-01-20  5:54 ` [PATCH 09/12] libext2: optimize ext2fs_new_block2() Theodore Ts'o
2014-01-20 21:52   ` Andreas Dilger
2014-01-21 15:54     ` Theodore Ts'o
2014-01-20  5:54 ` [PATCH 10/12] mke2fs: optimize fix_cluster_bg_counts() Theodore Ts'o
2014-01-20  5:54 ` [PATCH 11/12] Add support for new compat feature "sparse_super2" Theodore Ts'o
2014-01-20 21:49   ` Andreas Dilger
     [not found]   ` <alpine.DEB.2.10.1402051559390.16807@tglase.lan.tarent.de>
2014-02-05 16:17     ` Theodore Ts'o
2014-01-20  5:54 ` [PATCH 12/12] mke2fs: allow metadata blocks to be at the beginning of the file system Theodore Ts'o
2014-01-20 16:30   ` Theodore Ts'o
2014-01-20 23:25   ` Andreas Dilger
2014-01-21  6:23     ` Theodore Ts'o
2014-01-23 21:28       ` Andreas Dilger
2014-01-20 16:30 ` [PATCH 00/12] e2fsprogs mke2fs optimizations and new features Theodore Ts'o

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).