* [PATCH v2] ext4: fix online resize with mballoc
[not found] <20080701084357.819339274@bull.net>
@ 2008-07-01 9:17 ` Frédéric Bohé
2008-07-17 11:34 ` [PATCH][e2fsprogs] Fix inode table allocation with flexbg Frédéric Bohé
1 sibling, 0 replies; 3+ messages in thread
From: Frédéric Bohé @ 2008-07-01 9:17 UTC (permalink / raw)
To: linux-ext4
From: Frederic Bohe <frederic.bohe@bull.net>
Update group infos when updating a group's descriptor.
Add group infos when adding a group's descriptor.
Refresh cache pages used by mb_alloc when changes occur.
This will probably need modifications when META_BG resizing will be allowed.
Signed-off-by: Frederic Bohe <frederic.bohe@bull.net>
---
This patch apply on top of 2.6.26-rc6 + ext4-patch-queue-37b3e39765d8521ba2252bfec81bb91504fa35c8
It fixes oops when a filesystem is resized online while being mounted with mballoc.
Two oops have been identified:
The first one occurs when unmounting the resized filesystem.
The second one happens when trying to write to a group added during the online resize.
This patch has been tested with:
- small (100MB) and large (5TB) filesystems.
- 1K blocks and 4K blocks filesystems.
- inode size=256 up to 1024.
Tests consist in :
- online resizing, filling all blocks of the fs, unmounting, fs check
- filling all blocks of the fs, online resizing, filling newly added groups, unmounting, fs check
- Concurrent file copy/online resize, unmounting, fs check.
Non regression tests :
- offline resizing.
- online resizing without mballoc.
ext4.h | 4 +
mballoc.c | 236 +++++++++++++++++++++++++++++++++++++++++++++++---------------
resize.c | 52 +++++++++++++
3 files changed, 236 insertions(+), 56 deletions(-)
Index: linux-2.6.26-rc6.patch_resize/fs/ext4/ext4.h
===================================================================
--- linux-2.6.26-rc6.patch_resize.orig/fs/ext4/ext4.h 2008-06-30 18:41:26.000000000 +0200
+++ linux-2.6.26-rc6.patch_resize/fs/ext4/ext4.h 2008-06-30 18:47:22.000000000 +0200
@@ -1115,6 +1115,10 @@ extern int __init init_ext4_mballoc(void
extern void exit_ext4_mballoc(void);
extern void ext4_mb_free_blocks(handle_t *, struct inode *,
unsigned long, unsigned long, int, unsigned long *);
+extern int ext4_mb_add_more_groupinfo(struct super_block *sb,
+ ext4_group_t i, struct ext4_group_desc *desc);
+extern void ext4_mb_update_group_info(struct ext4_group_info *grp,
+ ext4_grpblk_t add);
/* inode.c */
Index: linux-2.6.26-rc6.patch_resize/fs/ext4/mballoc.c
===================================================================
--- linux-2.6.26-rc6.patch_resize.orig/fs/ext4/mballoc.c 2008-06-30 18:41:26.000000000 +0200
+++ linux-2.6.26-rc6.patch_resize/fs/ext4/mballoc.c 2008-07-01 10:39:05.000000000 +0200
@@ -2235,21 +2235,192 @@ ext4_mb_store_history(struct ext4_alloca
#define ext4_mb_history_init(sb)
#endif
+
+/* Create and initialize ext4_group_info data for the given group. */
+int ext4_mb_add_groupinfo(struct super_block *sb, ext4_group_t group,
+ struct ext4_group_desc *desc)
+{
+ int i, len;
+ int metalen = 0;
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+ struct ext4_group_info **meta_group_info;
+
+ /*
+ * First check if this group is the first of a reserved block.
+ * If it's true, we have to allocate a new table of pointers
+ * to ext4_group_info structures
+ */
+ if (group % EXT4_DESC_PER_BLOCK(sb) == 0) {
+ metalen = sizeof(*meta_group_info) <<
+ EXT4_DESC_PER_BLOCK_BITS(sb);
+ meta_group_info = kmalloc(metalen, GFP_KERNEL);
+ if (meta_group_info == NULL) {
+ printk(KERN_ERR "EXT4-fs: can't allocate mem for a "
+ "buddy group\n");
+ goto exit_meta_group_info;
+ }
+ sbi->s_group_info[group >> EXT4_DESC_PER_BLOCK_BITS(sb)] =
+ meta_group_info;
+ }
+
+ /*
+ * calculate needed size. if change bb_counters size,
+ * don't forget about ext4_mb_generate_buddy()
+ */
+ len = offsetof(typeof(**meta_group_info),
+ bb_counters[sb->s_blocksize_bits + 2]);
+
+ meta_group_info =
+ sbi->s_group_info[group >> EXT4_DESC_PER_BLOCK_BITS(sb)];
+ i = group & (EXT4_DESC_PER_BLOCK(sb) - 1);
+
+ meta_group_info[i] = kzalloc(len, GFP_KERNEL);
+ if (meta_group_info[i] == NULL) {
+ printk(KERN_ERR "EXT4-fs: can't allocate buddy mem\n");
+ goto exit_group_info;
+ }
+ set_bit(EXT4_GROUP_INFO_NEED_INIT_BIT,
+ &(meta_group_info[i]->bb_state));
+
+ /*
+ * initialize bb_free to be able to skip
+ * empty groups without initialization
+ */
+ if (desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
+ meta_group_info[i]->bb_free =
+ ext4_free_blocks_after_init(sb, group, desc);
+ } else {
+ meta_group_info[i]->bb_free =
+ le16_to_cpu(desc->bg_free_blocks_count);
+ }
+
+ INIT_LIST_HEAD(&meta_group_info[i]->bb_prealloc_list);
+
+#ifdef DOUBLE_CHECK
+ {
+ struct buffer_head *bh;
+ meta_group_info[i]->bb_bitmap =
+ kmalloc(sb->s_blocksize, GFP_KERNEL);
+ BUG_ON(meta_group_info[i]->bb_bitmap == NULL);
+ bh = ext4_read_block_bitmap(sb, group);
+ BUG_ON(bh == NULL);
+ memcpy(meta_group_info[i]->bb_bitmap, bh->b_data,
+ sb->s_blocksize);
+ put_bh(bh);
+ }
+#endif
+
+ return 0;
+
+exit_group_info:
+ /* If a meta_group_info table has been allocated, release it now */
+ if (group % EXT4_DESC_PER_BLOCK(sb) == 0)
+ kfree(sbi->s_group_info[group >> EXT4_DESC_PER_BLOCK_BITS(sb)]);
+exit_meta_group_info:
+ return -ENOMEM;
+} /* ext4_mb_add_groupinfo */
+
+/*
+ * Add a group to the existing groups.
+ * This function is used for online resize
+ */
+int ext4_mb_add_more_groupinfo(struct super_block *sb, ext4_group_t group,
+ struct ext4_group_desc *desc)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+ struct inode *inode = sbi->s_buddy_cache;
+ int blocks_per_page;
+ int block;
+ int pnum;
+ struct page *page;
+ int err;
+
+ /* Add group based on group descriptor*/
+ err = ext4_mb_add_groupinfo(sb, group, desc);
+ if (err)
+ return err;
+
+ /*
+ * Cache pages containing dynamic mb_alloc datas (buddy and bitmap
+ * datas) are set not up to date so that they will be re-initilaized
+ * during the next call to ext4_mb_load_buddy
+ */
+
+ /* Set buddy page as not up to date */
+ blocks_per_page = PAGE_CACHE_SIZE / sb->s_blocksize;
+ block = group * 2;
+ pnum = block / blocks_per_page;
+ page = find_get_page(inode->i_mapping, pnum);
+ if (page != NULL) {
+ ClearPageUptodate(page);
+ page_cache_release(page);
+ }
+
+ /* Set bitmap page as not up to date */
+ block++;
+ pnum = block / blocks_per_page;
+ page = find_get_page(inode->i_mapping, pnum);
+ if (page != NULL) {
+ ClearPageUptodate(page);
+ page_cache_release(page);
+ }
+
+ return 0;
+}
+
+/*
+ * Update an existing group.
+ * This function is used for online resize
+ */
+void ext4_mb_update_group_info(struct ext4_group_info *grp, ext4_grpblk_t add)
+{
+ grp->bb_free += add;
+}
+
static int ext4_mb_init_backend(struct super_block *sb)
{
ext4_group_t i;
- int j, len, metalen;
+ int metalen;
struct ext4_sb_info *sbi = EXT4_SB(sb);
- int num_meta_group_infos =
- (sbi->s_groups_count + EXT4_DESC_PER_BLOCK(sb) - 1) >>
- EXT4_DESC_PER_BLOCK_BITS(sb);
+ struct ext4_super_block *es = sbi->s_es;
+ int num_meta_group_infos;
+ int num_meta_group_infos_max;
+ int array_size;
struct ext4_group_info **meta_group_info;
+ struct ext4_group_desc *desc;
+
+ /* This is the number of blocks used by GDT */
+ num_meta_group_infos = (sbi->s_groups_count + EXT4_DESC_PER_BLOCK(sb) -
+ 1) >> EXT4_DESC_PER_BLOCK_BITS(sb);
+
+ /*
+ * This is the total number of blocks used by GDT including
+ * the number of reserved blocks for GDT.
+ * The s_group_info array is allocated with this value
+ * to allow a clean online resize without a complex
+ * manipulation of pointer.
+ * The drawback is the unused memory when no resize
+ * occurs but it's very low in terms of pages
+ * (see comments below)
+ * Need to handle this properly when META_BG resizing is allowed
+ */
+ num_meta_group_infos_max = num_meta_group_infos +
+ le16_to_cpu(es->s_reserved_gdt_blocks);
+ /*
+ * array_size is the size of s_group_info array. We round it
+ * to the next power of two because this approximation is done
+ * internally by kmalloc so we can have some more memory
+ * for free here (e.g. may be used for META_BG resize).
+ */
+ array_size = 1;
+ while (array_size < sizeof(*sbi->s_group_info) *
+ num_meta_group_infos_max)
+ array_size = array_size << 1;
/* An 8TB filesystem with 64-bit pointers requires a 4096 byte
* kmalloc. A 128kb malloc should suffice for a 256TB filesystem.
* So a two level scheme suffices for now. */
- sbi->s_group_info = kmalloc(sizeof(*sbi->s_group_info) *
- num_meta_group_infos, GFP_KERNEL);
+ sbi->s_group_info = kmalloc(array_size, GFP_KERNEL);
if (sbi->s_group_info == NULL) {
printk(KERN_ERR "EXT4-fs: can't allocate buddy meta group\n");
return -ENOMEM;
@@ -2276,62 +2447,15 @@ static int ext4_mb_init_backend(struct s
sbi->s_group_info[i] = meta_group_info;
}
- /*
- * calculate needed size. if change bb_counters size,
- * don't forget about ext4_mb_generate_buddy()
- */
- len = sizeof(struct ext4_group_info);
- len += sizeof(unsigned short) * (sb->s_blocksize_bits + 2);
for (i = 0; i < sbi->s_groups_count; i++) {
- struct ext4_group_desc *desc;
-
- meta_group_info =
- sbi->s_group_info[i >> EXT4_DESC_PER_BLOCK_BITS(sb)];
- j = i & (EXT4_DESC_PER_BLOCK(sb) - 1);
-
- meta_group_info[j] = kzalloc(len, GFP_KERNEL);
- if (meta_group_info[j] == NULL) {
- printk(KERN_ERR "EXT4-fs: can't allocate buddy mem\n");
- goto err_freebuddy;
- }
desc = ext4_get_group_desc(sb, i, NULL);
if (desc == NULL) {
printk(KERN_ERR
"EXT4-fs: can't read descriptor %lu\n", i);
- i++;
goto err_freebuddy;
}
- set_bit(EXT4_GROUP_INFO_NEED_INIT_BIT,
- &(meta_group_info[j]->bb_state));
-
- /*
- * initialize bb_free to be able to skip
- * empty groups without initialization
- */
- if (desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
- meta_group_info[j]->bb_free =
- ext4_free_blocks_after_init(sb, i, desc);
- } else {
- meta_group_info[j]->bb_free =
- le16_to_cpu(desc->bg_free_blocks_count);
- }
-
- INIT_LIST_HEAD(&meta_group_info[j]->bb_prealloc_list);
-
-#ifdef DOUBLE_CHECK
- {
- struct buffer_head *bh;
- meta_group_info[j]->bb_bitmap =
- kmalloc(sb->s_blocksize, GFP_KERNEL);
- BUG_ON(meta_group_info[j]->bb_bitmap == NULL);
- bh = ext4_read_block_bitmap(sb, i);
- BUG_ON(bh == NULL);
- memcpy(meta_group_info[j]->bb_bitmap, bh->b_data,
- sb->s_blocksize);
- put_bh(bh);
- }
-#endif
-
+ if (ext4_mb_add_groupinfo(sb, i, desc) != 0)
+ goto err_freebuddy;
}
return 0;
Index: linux-2.6.26-rc6.patch_resize/fs/ext4/resize.c
===================================================================
--- linux-2.6.26-rc6.patch_resize.orig/fs/ext4/resize.c 2008-06-30 18:41:26.000000000 +0200
+++ linux-2.6.26-rc6.patch_resize/fs/ext4/resize.c 2008-06-30 18:58:33.000000000 +0200
@@ -866,6 +866,15 @@ int ext4_group_add(struct super_block *s
gdp->bg_checksum = ext4_group_desc_csum(sbi, input->group, gdp);
/*
+ * We can allocate memory for mb_alloc based on the new group
+ * descriptor
+ */
+ if (test_opt(sb, MBALLOC)) {
+ err = ext4_mb_add_more_groupinfo(sb, input->group, gdp);
+ if (err)
+ goto exit_journal;
+ }
+ /*
* Make the new blocks and inodes valid next. We do this before
* increasing the group count so that once the group is enabled,
* all of its blocks and inodes are already valid.
@@ -957,6 +966,8 @@ int ext4_group_extend(struct super_block
handle_t *handle;
int err;
unsigned long freed_blocks;
+ ext4_group_t group;
+ struct ext4_group_info *grp;
/* We don't need to worry about locking wrt other resizers just
* yet: we're going to revalidate es->s_blocks_count after
@@ -988,7 +999,7 @@ int ext4_group_extend(struct super_block
}
/* Handle the remaining blocks in the last group only. */
- ext4_get_group_no_and_offset(sb, o_blocks_count, NULL, &last);
+ ext4_get_group_no_and_offset(sb, o_blocks_count, &group, &last);
if (last == 0) {
ext4_warning(sb, __func__,
@@ -1060,6 +1071,45 @@ int ext4_group_extend(struct super_block
o_blocks_count + add);
if ((err = ext4_journal_stop(handle)))
goto exit_put;
+
+ /*
+ * Mark mballoc pages as not up to date so that they will be updated
+ * next time they are loaded by ext4_mb_load_buddy.
+ */
+ if (test_opt(sb, MBALLOC)) {
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+ struct inode *inode = sbi->s_buddy_cache;
+ int blocks_per_page;
+ int block;
+ int pnum;
+ struct page *page;
+
+ /* Set buddy page as not up to date */
+ blocks_per_page = PAGE_CACHE_SIZE / sb->s_blocksize;
+ block = group * 2;
+ pnum = block / blocks_per_page;
+ page = find_get_page(inode->i_mapping, pnum);
+ if (page != NULL) {
+ ClearPageUptodate(page);
+ page_cache_release(page);
+ }
+
+ /* Set bitmap page as not up to date */
+ block++;
+ pnum = block / blocks_per_page;
+ page = find_get_page(inode->i_mapping, pnum);
+ if (page != NULL) {
+ ClearPageUptodate(page);
+ page_cache_release(page);
+ }
+
+ /* Get the info on the last group */
+ grp = ext4_get_group_info(sb, group);
+
+ /* Update free blocks in group info */
+ ext4_mb_update_group_info(grp, add);
+ }
+
if (test_opt(sb, DEBUG))
printk(KERN_DEBUG "EXT4-fs: extended group to %llu blocks\n",
ext4_blocks_count(es));
--
^ permalink raw reply [flat|nested] 3+ messages in thread
* [PATCH][e2fsprogs] Fix inode table allocation with flexbg
[not found] <20080701084357.819339274@bull.net>
2008-07-01 9:17 ` [PATCH v2] ext4: fix online resize with mballoc Frédéric Bohé
@ 2008-07-17 11:34 ` Frédéric Bohé
2008-07-17 13:47 ` Jose R. Santos
1 sibling, 1 reply; 3+ messages in thread
From: Frédéric Bohé @ 2008-07-17 11:34 UTC (permalink / raw)
To: linux-ext4
From: Frederic Bohe <frederic.bohe@bull.net>
Disordered inode tables may appear when inode_blocks_per_group is lesser
or equal to the number of groups in a flex group.
Signed-off-by: Frederic Bohe <frederic.bohe@bull.net>
---
This bug can be reproduced with:
mkfs.ext4 -t ext4dev -G512 70G
In that case, you can see with dump2fs that inode tables for groups 510
and 511 are placed just after group 51's inode table instead of being
placed after group 509's inode table.
alloc_tables.c | 16 +++++++---------
1 file changed, 7 insertions(+), 9 deletions(-)
Index: e2fsprogs/lib/ext2fs/alloc_tables.c
===================================================================
--- e2fsprogs.orig/lib/ext2fs/alloc_tables.c 2008-07-17 10:33:56.000000000 +0200
+++ e2fsprogs/lib/ext2fs/alloc_tables.c 2008-07-17 10:46:49.000000000 +0200
@@ -34,9 +34,10 @@
* tables can be allocated continously and in order.
*/
static blk_t flexbg_offset(ext2_filsys fs, dgrp_t group, blk_t start_blk,
- ext2fs_block_bitmap bmap, int offset, int size)
+ ext2fs_block_bitmap bmap, int offset, int size,
+ int elem_size)
{
- int flexbg, flexbg_size, elem_size;
+ int flexbg, flexbg_size;
blk_t last_blk, first_free = 0;
dgrp_t last_grp;
@@ -54,10 +55,6 @@ static blk_t flexbg_offset(ext2_filsys f
* search is still valid.
*/
if (start_blk && group % flexbg_size) {
- if (size > flexbg_size)
- elem_size = fs->inode_blocks_per_group;
- else
- elem_size = 1;
if (ext2fs_test_block_bitmap_range(bmap, start_blk + elem_size,
size))
return start_blk + elem_size;
@@ -126,7 +123,7 @@ errcode_t ext2fs_allocate_group_table(ex
if (group && fs->group_desc[group-1].bg_block_bitmap)
prev_block = fs->group_desc[group-1].bg_block_bitmap;
start_blk = flexbg_offset(fs, group, prev_block, bmap,
- 0, rem_grps);
+ 0, rem_grps, 1);
last_blk = ext2fs_group_last_block(fs, last_grp);
}
@@ -154,7 +151,7 @@ errcode_t ext2fs_allocate_group_table(ex
if (group && fs->group_desc[group-1].bg_inode_bitmap)
prev_block = fs->group_desc[group-1].bg_inode_bitmap;
start_blk = flexbg_offset(fs, group, prev_block, bmap,
- flexbg_size, rem_grps);
+ flexbg_size, rem_grps, 1);
last_blk = ext2fs_group_last_block(fs, last_grp);
}
@@ -187,7 +184,8 @@ errcode_t ext2fs_allocate_group_table(ex
group_blk = flexbg_offset(fs, group, prev_block, bmap,
flexbg_size * 2,
fs->inode_blocks_per_group *
- rem_grps);
+ rem_grps,
+ fs->inode_blocks_per_group);
last_blk = ext2fs_group_last_block(fs, last_grp);
}
--
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [PATCH][e2fsprogs] Fix inode table allocation with flexbg
2008-07-17 11:34 ` [PATCH][e2fsprogs] Fix inode table allocation with flexbg Frédéric Bohé
@ 2008-07-17 13:47 ` Jose R. Santos
0 siblings, 0 replies; 3+ messages in thread
From: Jose R. Santos @ 2008-07-17 13:47 UTC (permalink / raw)
To: Frédéric Bohé; +Cc: linux-ext4
On Thu, 17 Jul 2008 13:34:41 +0200
Frédéric Bohé <frederic.bohe@bull.net> wrote:
> From: Frederic Bohe <frederic.bohe@bull.net>
>
> Disordered inode tables may appear when inode_blocks_per_group is lesser
> or equal to the number of groups in a flex group.
>
Acked-by: Jose R. Santos <jrs@us.ibm.com> with some comments bellow
> Signed-off-by: Frederic Bohe <frederic.bohe@bull.net>
> ---
> This bug can be reproduced with:
> mkfs.ext4 -t ext4dev -G512 70G
>
> In that case, you can see with dump2fs that inode tables for groups 510
> and 511 are placed just after group 51's inode table instead of being
> placed after group 509's inode table.
>
> alloc_tables.c | 16 +++++++---------
> 1 file changed, 7 insertions(+), 9 deletions(-)
>
> Index: e2fsprogs/lib/ext2fs/alloc_tables.c
> ===================================================================
> --- e2fsprogs.orig/lib/ext2fs/alloc_tables.c 2008-07-17 10:33:56.000000000 +0200
> +++ e2fsprogs/lib/ext2fs/alloc_tables.c 2008-07-17 10:46:49.000000000 +0200
> @@ -34,9 +34,10 @@
> * tables can be allocated continously and in order.
> */
> static blk_t flexbg_offset(ext2_filsys fs, dgrp_t group, blk_t start_blk,
> - ext2fs_block_bitmap bmap, int offset, int size)
> + ext2fs_block_bitmap bmap, int offset, int size,
> + int elem_size)
While there is nothing technically wrong with the patch I think it
exposes an issue with the original flexbg_offset function. The routine
right now requires to many arguments and you need to do some
calculations within some of the arguments which make this look ugly
and error prone if someone decides to modify this code later on.
I've been thinking about rewriting this function so it looks something
like this:
static blk64_t flexbg_offset(ext2_filsys fs, ext2fs_block_bitmap bmap,
dgrp_t group, unsigned int type)
and do all the offset, start_blk, size, elem_size calculations inside
the routine. I need to make a long term fix for this issue since I
think 7 args for this routine is just to much.
-JRS
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2008-07-17 13:48 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20080701084357.819339274@bull.net>
2008-07-01 9:17 ` [PATCH v2] ext4: fix online resize with mballoc Frédéric Bohé
2008-07-17 11:34 ` [PATCH][e2fsprogs] Fix inode table allocation with flexbg Frédéric Bohé
2008-07-17 13:47 ` Jose R. Santos
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).