* [PATCH 1/5] ext4: Error out if verifying the block bitmap fails
2013-07-19 23:55 [PATCH v1 0/5] ext4: Shut down block groups when damage is detected Darrick J. Wong
@ 2013-07-19 23:55 ` Darrick J. Wong
2013-08-28 19:36 ` Theodore Ts'o
2013-07-19 23:55 ` [PATCH 2/5] ext4: Fix type declaration of ext4_validate_block_bitmap Darrick J. Wong
` (4 subsequent siblings)
5 siblings, 1 reply; 22+ messages in thread
From: Darrick J. Wong @ 2013-07-19 23:55 UTC (permalink / raw)
To: Theodore Ts'o, Darrick J. Wong; +Cc: linux-ext4
The block bitmap verification code assumes that calling ext4_error() either
panics the system or makes the fs readonly. However, this is not always true:
when 'errors=continue' is specified, an error is printed but we don't return
any indication of error to the caller, which is (probably) the block allocator,
which pretends that the crud we read in off the disk is a usable bitmap. Yuck.
A block bitmap that fails the check should at least return no bitmap to the
caller. The block allocator should be told to go look in a different group,
but that's a separate issue.
The easiest way to reproduce this is to modify bg_block_bitmap (on a ^flex_bg
fs) to point to a block outside the block group; or you can create a
metadata_csum filesystem and zero out the block bitmaps.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
fs/ext4/balloc.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index ddd715e..b430afe 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -445,7 +445,10 @@ ext4_read_block_bitmap_nowait(struct super_block *sb, ext4_group_t block_group)
return bh;
verify:
ext4_validate_block_bitmap(sb, desc, block_group, bh);
- return bh;
+ if (buffer_verified(bh))
+ return bh;
+ put_bh(bh);
+ return NULL;
}
/* Returns 0 on success, 1 on error */
@@ -469,7 +472,8 @@ int ext4_wait_block_bitmap(struct super_block *sb, ext4_group_t block_group,
clear_buffer_new(bh);
/* Panic or remount fs read-only if block bitmap is invalid */
ext4_validate_block_bitmap(sb, desc, block_group, bh);
- return 0;
+ /* ...but check for error just in case errors=continue. */
+ return !buffer_verified(bh);
}
struct buffer_head *
^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH 1/5] ext4: Error out if verifying the block bitmap fails
2013-07-19 23:55 ` [PATCH 1/5] ext4: Error out if verifying the block bitmap fails Darrick J. Wong
@ 2013-08-28 19:36 ` Theodore Ts'o
0 siblings, 0 replies; 22+ messages in thread
From: Theodore Ts'o @ 2013-08-28 19:36 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: linux-ext4
On Fri, Jul 19, 2013 at 04:55:39PM -0700, Darrick J. Wong wrote:
> The block bitmap verification code assumes that calling ext4_error() either
> panics the system or makes the fs readonly. However, this is not always true:
> when 'errors=continue' is specified, an error is printed but we don't return
> any indication of error to the caller, which is (probably) the block allocator,
> which pretends that the crud we read in off the disk is a usable bitmap. Yuck.
>
> A block bitmap that fails the check should at least return no bitmap to the
> caller. The block allocator should be told to go look in a different group,
> but that's a separate issue.
>
> The easiest way to reproduce this is to modify bg_block_bitmap (on a ^flex_bg
> fs) to point to a block outside the block group; or you can create a
> metadata_csum filesystem and zero out the block bitmaps.
>
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Thanks, applied.
- Ted
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 2/5] ext4: Fix type declaration of ext4_validate_block_bitmap
2013-07-19 23:55 [PATCH v1 0/5] ext4: Shut down block groups when damage is detected Darrick J. Wong
2013-07-19 23:55 ` [PATCH 1/5] ext4: Error out if verifying the block bitmap fails Darrick J. Wong
@ 2013-07-19 23:55 ` Darrick J. Wong
2013-07-24 7:12 ` Zheng Liu
2013-07-19 23:55 ` [PATCH 3/5] ext4: Mark block group as corrupt on block bitmap error Darrick J. Wong
` (3 subsequent siblings)
5 siblings, 1 reply; 22+ messages in thread
From: Darrick J. Wong @ 2013-07-19 23:55 UTC (permalink / raw)
To: Theodore Ts'o, Darrick J. Wong; +Cc: linux-ext4
The block_group parameter to ext4_validate_block_bitmap is both used as a
ext4_group_t inside the function and the same type is passed in by all callers.
We might as well use the typedef consistently instead of open-coding the
'unsigned int'.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
fs/ext4/balloc.c | 2 +-
fs/ext4/ext4.h | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index b430afe..735e701 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -352,7 +352,7 @@ static ext4_fsblk_t ext4_valid_block_bitmap(struct super_block *sb,
void ext4_validate_block_bitmap(struct super_block *sb,
struct ext4_group_desc *desc,
- unsigned int block_group,
+ ext4_group_t block_group,
struct buffer_head *bh)
{
ext4_fsblk_t blk;
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index b577e45..39d24e2 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1915,7 +1915,7 @@ extern ext4_group_t ext4_get_group_number(struct super_block *sb,
extern void ext4_validate_block_bitmap(struct super_block *sb,
struct ext4_group_desc *desc,
- unsigned int block_group,
+ ext4_group_t block_group,
struct buffer_head *bh);
extern unsigned int ext4_block_group(struct super_block *sb,
ext4_fsblk_t blocknr);
^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH 2/5] ext4: Fix type declaration of ext4_validate_block_bitmap
2013-07-19 23:55 ` [PATCH 2/5] ext4: Fix type declaration of ext4_validate_block_bitmap Darrick J. Wong
@ 2013-07-24 7:12 ` Zheng Liu
2013-07-26 16:06 ` Darrick J. Wong
0 siblings, 1 reply; 22+ messages in thread
From: Zheng Liu @ 2013-07-24 7:12 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: Theodore Ts'o, linux-ext4
On Fri, Jul 19, 2013 at 04:55:46PM -0700, Darrick J. Wong wrote:
> The block_group parameter to ext4_validate_block_bitmap is both used as a
> ext4_group_t inside the function and the same type is passed in by all callers.
> We might as well use the typedef consistently instead of open-coding the
> 'unsigned int'.
>
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
> fs/ext4/balloc.c | 2 +-
> fs/ext4/ext4.h | 2 +-
> 2 files changed, 2 insertions(+), 2 deletions(-)
>
>
> diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
> index b430afe..735e701 100644
> --- a/fs/ext4/balloc.c
> +++ b/fs/ext4/balloc.c
> @@ -352,7 +352,7 @@ static ext4_fsblk_t ext4_valid_block_bitmap(struct super_block *sb,
^^^
It seems that this function also needs to be fixed.
- Zheng
>
> void ext4_validate_block_bitmap(struct super_block *sb,
> struct ext4_group_desc *desc,
> - unsigned int block_group,
> + ext4_group_t block_group,
> struct buffer_head *bh)
> {
> ext4_fsblk_t blk;
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index b577e45..39d24e2 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1915,7 +1915,7 @@ extern ext4_group_t ext4_get_group_number(struct super_block *sb,
>
> extern void ext4_validate_block_bitmap(struct super_block *sb,
> struct ext4_group_desc *desc,
> - unsigned int block_group,
> + ext4_group_t block_group,
> struct buffer_head *bh);
> extern unsigned int ext4_block_group(struct super_block *sb,
> ext4_fsblk_t blocknr);
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 2/5] ext4: Fix type declaration of ext4_validate_block_bitmap
2013-07-24 7:12 ` Zheng Liu
@ 2013-07-26 16:06 ` Darrick J. Wong
2013-08-28 20:01 ` Theodore Ts'o
0 siblings, 1 reply; 22+ messages in thread
From: Darrick J. Wong @ 2013-07-26 16:06 UTC (permalink / raw)
To: Theodore Ts'o, linux-ext4
On Wed, Jul 24, 2013 at 03:12:01PM +0800, Zheng Liu wrote:
> On Fri, Jul 19, 2013 at 04:55:46PM -0700, Darrick J. Wong wrote:
> > The block_group parameter to ext4_validate_block_bitmap is both used as a
> > ext4_group_t inside the function and the same type is passed in by all callers.
> > We might as well use the typedef consistently instead of open-coding the
> > 'unsigned int'.
> >
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> > fs/ext4/balloc.c | 2 +-
> > fs/ext4/ext4.h | 2 +-
> > 2 files changed, 2 insertions(+), 2 deletions(-)
> >
> >
> > diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
> > index b430afe..735e701 100644
> > --- a/fs/ext4/balloc.c
> > +++ b/fs/ext4/balloc.c
> > @@ -352,7 +352,7 @@ static ext4_fsblk_t ext4_valid_block_bitmap(struct super_block *sb,
> ^^^
> It seems that this function also needs to be fixed.
Yep. I'll get that in the next version, thank you.
--D
>
> - Zheng
>
> >
> > void ext4_validate_block_bitmap(struct super_block *sb,
> > struct ext4_group_desc *desc,
> > - unsigned int block_group,
> > + ext4_group_t block_group,
> > struct buffer_head *bh)
> > {
> > ext4_fsblk_t blk;
> > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> > index b577e45..39d24e2 100644
> > --- a/fs/ext4/ext4.h
> > +++ b/fs/ext4/ext4.h
> > @@ -1915,7 +1915,7 @@ extern ext4_group_t ext4_get_group_number(struct super_block *sb,
> >
> > extern void ext4_validate_block_bitmap(struct super_block *sb,
> > struct ext4_group_desc *desc,
> > - unsigned int block_group,
> > + ext4_group_t block_group,
> > struct buffer_head *bh);
> > extern unsigned int ext4_block_group(struct super_block *sb,
> > ext4_fsblk_t blocknr);
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 2/5] ext4: Fix type declaration of ext4_validate_block_bitmap
2013-07-26 16:06 ` Darrick J. Wong
@ 2013-08-28 20:01 ` Theodore Ts'o
0 siblings, 0 replies; 22+ messages in thread
From: Theodore Ts'o @ 2013-08-28 20:01 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: linux-ext4
On Fri, Jul 26, 2013 at 09:06:13AM -0700, Darrick J. Wong wrote:
> > > @@ -352,7 +352,7 @@ static ext4_fsblk_t ext4_valid_block_bitmap(struct super_block *sb,
> > ^^^
> > It seems that this function also needs to be fixed.
>
> Yep. I'll get that in the next version, thank you.
I've applied this patch with addition of a fix for s/unsigned int/ext4_group_t/
in the ext4_valid_block_bitmap() function as well.
Thanks,
- Ted
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 3/5] ext4: Mark block group as corrupt on block bitmap error
2013-07-19 23:55 [PATCH v1 0/5] ext4: Shut down block groups when damage is detected Darrick J. Wong
2013-07-19 23:55 ` [PATCH 1/5] ext4: Error out if verifying the block bitmap fails Darrick J. Wong
2013-07-19 23:55 ` [PATCH 2/5] ext4: Fix type declaration of ext4_validate_block_bitmap Darrick J. Wong
@ 2013-07-19 23:55 ` Darrick J. Wong
2013-07-23 3:38 ` Darrick J. Wong
2013-07-19 23:55 ` [PATCH 4/5] ext4: Mark block group as corrupt on inode " Darrick J. Wong
` (2 subsequent siblings)
5 siblings, 1 reply; 22+ messages in thread
From: Darrick J. Wong @ 2013-07-19 23:55 UTC (permalink / raw)
To: Theodore Ts'o, Darrick J. Wong; +Cc: linux-ext4
When we notice a block-bitmap corruption (because of device failure or
something else), we should mark this group as corrupt and prevent further block
allocations/deallocations from it. Currently, we end up generating one error
message for every block in the bitmap. This potentially could make the system
unstable as noticed in some bugs. With this patch, the error will be printed
only the first time and mark the entire block group as corrupted. This prevents
future access allocations/deallocations from it.
Also tested by corrupting the block
bitmap and forcefully introducing the mb_free_blocks error:
(1) create a largefile (2Gb)
$ dd if=/dev/zero of=largefile oflag=direct bs=10485760 count=200
(2) umount filesystem. use dumpe2fs to see which block-bitmaps
are in use by largefile and note their block numbers
(3) use dd to zero-out the used block bitmaps
$ dd if=/dev/zero of=/dev/hdc4 bs=4096 seek=14 count=8 oflag=direct
(4) mount the FS and delete the largefile.
(5) recreate the largefile. verify that the new largefile does not
get any blocks from the groups marked as bad.
Without the patch, we will see mb_free_blocks error for each bit in
each zero'ed out bitmap at (4). With the patch, we only see the error
once per blockgroup:
[ 309.706803] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 15: 32768 clusters in bitmap, 0 in gd. blk grp corrupted.
[ 309.720824] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 14: 32768 clusters in bitmap, 0 in gd. blk grp corrupted.
[ 309.732858] EXT4-fs error (device sdb4) in ext4_free_blocks:4802: IO failure
[ 309.748321] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 13: 32768 clusters in bitmap, 0 in gd. blk grp corrupted.
[ 309.760331] EXT4-fs error (device sdb4) in ext4_free_blocks:4802: IO failure
[ 309.769695] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 12: 32768 clusters in bitmap, 0 in gd. blk grp corrupted.
[ 309.781721] EXT4-fs error (device sdb4) in ext4_free_blocks:4802: IO failure
[ 309.798166] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 11: 32768 clusters in bitmap, 0 in gd. blk grp corrupted.
[ 309.810184] EXT4-fs error (device sdb4) in ext4_free_blocks:4802: IO failure
[ 309.819532] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 10: 32768 clusters in bitmap, 0 in gd. blk grp corrupted.
Google-Bug-Id: 7258357
[darrick.wong@oracle.com]
Further modifications (by Darrick) to make more obvious that this corruption
bit applies to blocks only. Set the corruption flag if the block group bitmap
verification fails.
Original-author: Aditya Kali <adityakali@google.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
fs/ext4/balloc.c | 3 +++
fs/ext4/ext4.h | 3 +++
fs/ext4/mballoc.c | 28 +++++++++++++++++++++++++---
3 files changed, 31 insertions(+), 3 deletions(-)
diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index 735e701..b4c406b 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -356,6 +356,7 @@ void ext4_validate_block_bitmap(struct super_block *sb,
struct buffer_head *bh)
{
ext4_fsblk_t blk;
+ struct ext4_group_info *grp = ext4_get_group_info(sb, block_group);
if (buffer_verified(bh))
return;
@@ -366,12 +367,14 @@ void ext4_validate_block_bitmap(struct super_block *sb,
ext4_unlock_group(sb, block_group);
ext4_error(sb, "bg %u: block %llu: invalid block bitmap",
block_group, blk);
+ set_bit(EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT, &grp->bb_state);
return;
}
if (unlikely(!ext4_block_bitmap_csum_verify(sb, block_group,
desc, bh))) {
ext4_unlock_group(sb, block_group);
ext4_error(sb, "bg %u: bad block bitmap checksum", block_group);
+ set_bit(EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT, &grp->bb_state);
return;
}
set_buffer_verified(bh);
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 39d24e2..45cc955 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2448,9 +2448,12 @@ struct ext4_group_info {
#define EXT4_GROUP_INFO_NEED_INIT_BIT 0
#define EXT4_GROUP_INFO_WAS_TRIMMED_BIT 1
+#define EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT 2
#define EXT4_MB_GRP_NEED_INIT(grp) \
(test_bit(EXT4_GROUP_INFO_NEED_INIT_BIT, &((grp)->bb_state)))
+#define EXT4_MB_GRP_BBITMAP_CORRUPT(grp) \
+ (test_bit(EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT, &((grp)->bb_state)))
#define EXT4_MB_GRP_WAS_TRIMMED(grp) \
(test_bit(EXT4_GROUP_INFO_WAS_TRIMMED_BIT, &((grp)->bb_state)))
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 4bbbf13b..40ebcf6 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -751,13 +751,15 @@ void ext4_mb_generate_buddy(struct super_block *sb,
if (free != grp->bb_free) {
ext4_grp_locked_error(sb, group, 0, 0,
- "%u clusters in bitmap, %u in gd",
+ "%u clusters in bitmap, %u in gd; "
+ "block bitmap corrupt.",
free, grp->bb_free);
/*
- * If we intent to continue, we consider group descritor
+ * If we intend to continue, we consider group descriptor
* corrupt and update bb_free using bitmap value
*/
grp->bb_free = free;
+ set_bit(EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT, &grp->bb_state);
}
mb_set_largest_free_order(sb, grp);
@@ -1398,6 +1400,10 @@ static void mb_free_blocks(struct inode *inode, struct ext4_buddy *e4b,
BUG_ON(last >= (sb->s_blocksize << 3));
assert_spin_locked(ext4_group_lock_ptr(sb, e4b->bd_group));
+ /* Don't bother if the block group is corrupt. */
+ if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(e4b->bd_info)))
+ return;
+
mb_check_buddy(e4b);
mb_free_blocks_double(inode, e4b, first, count);
@@ -1423,7 +1429,11 @@ static void mb_free_blocks(struct inode *inode, struct ext4_buddy *e4b,
inode ? inode->i_ino : 0,
blocknr,
"freeing already freed block "
- "(bit %u)", block);
+ "(bit %u); block bitmap corrupt.",
+ block);
+ /* Mark the block group as corrupt. */
+ set_bit(EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT,
+ &e4b->bd_info->bb_state);
mb_regenerate_buddy(e4b);
goto done;
}
@@ -1790,6 +1800,11 @@ int ext4_mb_find_by_goal(struct ext4_allocation_context *ac,
if (err)
return err;
+ if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(e4b->bd_info))) {
+ ext4_mb_unload_buddy(e4b);
+ return 0;
+ }
+
ext4_lock_group(ac->ac_sb, group);
max = mb_find_extent(e4b, ac->ac_g_ex.fe_start,
ac->ac_g_ex.fe_len, &ex);
@@ -1987,6 +2002,9 @@ static int ext4_mb_good_group(struct ext4_allocation_context *ac,
if (cr <= 2 && free < ac->ac_g_ex.fe_len)
return 0;
+ if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(grp)))
+ return 0;
+
/* We only do this if the grp has never been initialized */
if (unlikely(EXT4_MB_GRP_NEED_INIT(grp))) {
int ret = ext4_mb_init_group(ac->ac_sb, group);
@@ -4673,6 +4691,10 @@ do_more:
overflow = 0;
ext4_get_group_no_and_offset(sb, block, &block_group, &bit);
+ if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(
+ ext4_get_group_info(sb, block_group))))
+ return;
+
/*
* Check to see if we are freeing blocks across a group
* boundary.
^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH 3/5] ext4: Mark block group as corrupt on block bitmap error
2013-07-19 23:55 ` [PATCH 3/5] ext4: Mark block group as corrupt on block bitmap error Darrick J. Wong
@ 2013-07-23 3:38 ` Darrick J. Wong
2013-08-28 22:26 ` Theodore Ts'o
0 siblings, 1 reply; 22+ messages in thread
From: Darrick J. Wong @ 2013-07-23 3:38 UTC (permalink / raw)
To: Theodore Ts'o; +Cc: linux-ext4
On Fri, Jul 19, 2013 at 04:55:52PM -0700, Darrick J. Wong wrote:
> When we notice a block-bitmap corruption (because of device failure or
> something else), we should mark this group as corrupt and prevent further block
> allocations/deallocations from it. Currently, we end up generating one error
> message for every block in the bitmap. This potentially could make the system
> unstable as noticed in some bugs. With this patch, the error will be printed
> only the first time and mark the entire block group as corrupted. This prevents
> future access allocations/deallocations from it.
>
> Also tested by corrupting the block
> bitmap and forcefully introducing the mb_free_blocks error:
> (1) create a largefile (2Gb)
> $ dd if=/dev/zero of=largefile oflag=direct bs=10485760 count=200
> (2) umount filesystem. use dumpe2fs to see which block-bitmaps
> are in use by largefile and note their block numbers
> (3) use dd to zero-out the used block bitmaps
> $ dd if=/dev/zero of=/dev/hdc4 bs=4096 seek=14 count=8 oflag=direct
> (4) mount the FS and delete the largefile.
> (5) recreate the largefile. verify that the new largefile does not
> get any blocks from the groups marked as bad.
> Without the patch, we will see mb_free_blocks error for each bit in
> each zero'ed out bitmap at (4). With the patch, we only see the error
> once per blockgroup:
> [ 309.706803] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 15: 32768 clusters in bitmap, 0 in gd. blk grp corrupted.
> [ 309.720824] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 14: 32768 clusters in bitmap, 0 in gd. blk grp corrupted.
> [ 309.732858] EXT4-fs error (device sdb4) in ext4_free_blocks:4802: IO failure
> [ 309.748321] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 13: 32768 clusters in bitmap, 0 in gd. blk grp corrupted.
> [ 309.760331] EXT4-fs error (device sdb4) in ext4_free_blocks:4802: IO failure
> [ 309.769695] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 12: 32768 clusters in bitmap, 0 in gd. blk grp corrupted.
> [ 309.781721] EXT4-fs error (device sdb4) in ext4_free_blocks:4802: IO failure
> [ 309.798166] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 11: 32768 clusters in bitmap, 0 in gd. blk grp corrupted.
> [ 309.810184] EXT4-fs error (device sdb4) in ext4_free_blocks:4802: IO failure
> [ 309.819532] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 10: 32768 clusters in bitmap, 0 in gd. blk grp corrupted.
>
> Google-Bug-Id: 7258357
Hmm. I think we need to have ext4_count_free_clusters() act as though corrupt
block groups have "zero" free blocks so that mballoc will pass the -ENOSPC
errors back to the upper layers. Afaict, if one doesn't do this, ext4
encounters the situation where marking the blocks in use fails, yet the fs
thinks there are free blocks still and ... leaves the pages dirty forever,
instead of simply failing.
Just trying this really quickly, if I blast /all/ the block groups, I see
unstoppable errors in dmesg.
The other thing I noticed is that if one turns delalloc mode on, performs a
live corruption of the bg descriptors, and then dd's a big file to the fs,
there's no error reported back to userspace either in write(), sync(), or even
umount(). Meanwhile, dmesg is getting hit with tons of corrupted-bitmap
errors.
More for me to ponder....
--D
> [darrick.wong@oracle.com]
> Further modifications (by Darrick) to make more obvious that this corruption
> bit applies to blocks only. Set the corruption flag if the block group bitmap
> verification fails.
>
> Original-author: Aditya Kali <adityakali@google.com>
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
> fs/ext4/balloc.c | 3 +++
> fs/ext4/ext4.h | 3 +++
> fs/ext4/mballoc.c | 28 +++++++++++++++++++++++++---
> 3 files changed, 31 insertions(+), 3 deletions(-)
>
>
> diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
> index 735e701..b4c406b 100644
> --- a/fs/ext4/balloc.c
> +++ b/fs/ext4/balloc.c
> @@ -356,6 +356,7 @@ void ext4_validate_block_bitmap(struct super_block *sb,
> struct buffer_head *bh)
> {
> ext4_fsblk_t blk;
> + struct ext4_group_info *grp = ext4_get_group_info(sb, block_group);
>
> if (buffer_verified(bh))
> return;
> @@ -366,12 +367,14 @@ void ext4_validate_block_bitmap(struct super_block *sb,
> ext4_unlock_group(sb, block_group);
> ext4_error(sb, "bg %u: block %llu: invalid block bitmap",
> block_group, blk);
> + set_bit(EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT, &grp->bb_state);
> return;
> }
> if (unlikely(!ext4_block_bitmap_csum_verify(sb, block_group,
> desc, bh))) {
> ext4_unlock_group(sb, block_group);
> ext4_error(sb, "bg %u: bad block bitmap checksum", block_group);
> + set_bit(EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT, &grp->bb_state);
> return;
> }
> set_buffer_verified(bh);
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 39d24e2..45cc955 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -2448,9 +2448,12 @@ struct ext4_group_info {
>
> #define EXT4_GROUP_INFO_NEED_INIT_BIT 0
> #define EXT4_GROUP_INFO_WAS_TRIMMED_BIT 1
> +#define EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT 2
>
> #define EXT4_MB_GRP_NEED_INIT(grp) \
> (test_bit(EXT4_GROUP_INFO_NEED_INIT_BIT, &((grp)->bb_state)))
> +#define EXT4_MB_GRP_BBITMAP_CORRUPT(grp) \
> + (test_bit(EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT, &((grp)->bb_state)))
>
> #define EXT4_MB_GRP_WAS_TRIMMED(grp) \
> (test_bit(EXT4_GROUP_INFO_WAS_TRIMMED_BIT, &((grp)->bb_state)))
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 4bbbf13b..40ebcf6 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -751,13 +751,15 @@ void ext4_mb_generate_buddy(struct super_block *sb,
>
> if (free != grp->bb_free) {
> ext4_grp_locked_error(sb, group, 0, 0,
> - "%u clusters in bitmap, %u in gd",
> + "%u clusters in bitmap, %u in gd; "
> + "block bitmap corrupt.",
> free, grp->bb_free);
> /*
> - * If we intent to continue, we consider group descritor
> + * If we intend to continue, we consider group descriptor
> * corrupt and update bb_free using bitmap value
> */
> grp->bb_free = free;
> + set_bit(EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT, &grp->bb_state);
> }
> mb_set_largest_free_order(sb, grp);
>
> @@ -1398,6 +1400,10 @@ static void mb_free_blocks(struct inode *inode, struct ext4_buddy *e4b,
>
> BUG_ON(last >= (sb->s_blocksize << 3));
> assert_spin_locked(ext4_group_lock_ptr(sb, e4b->bd_group));
> + /* Don't bother if the block group is corrupt. */
> + if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(e4b->bd_info)))
> + return;
> +
> mb_check_buddy(e4b);
> mb_free_blocks_double(inode, e4b, first, count);
>
> @@ -1423,7 +1429,11 @@ static void mb_free_blocks(struct inode *inode, struct ext4_buddy *e4b,
> inode ? inode->i_ino : 0,
> blocknr,
> "freeing already freed block "
> - "(bit %u)", block);
> + "(bit %u); block bitmap corrupt.",
> + block);
> + /* Mark the block group as corrupt. */
> + set_bit(EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT,
> + &e4b->bd_info->bb_state);
> mb_regenerate_buddy(e4b);
> goto done;
> }
> @@ -1790,6 +1800,11 @@ int ext4_mb_find_by_goal(struct ext4_allocation_context *ac,
> if (err)
> return err;
>
> + if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(e4b->bd_info))) {
> + ext4_mb_unload_buddy(e4b);
> + return 0;
> + }
> +
> ext4_lock_group(ac->ac_sb, group);
> max = mb_find_extent(e4b, ac->ac_g_ex.fe_start,
> ac->ac_g_ex.fe_len, &ex);
> @@ -1987,6 +2002,9 @@ static int ext4_mb_good_group(struct ext4_allocation_context *ac,
> if (cr <= 2 && free < ac->ac_g_ex.fe_len)
> return 0;
>
> + if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(grp)))
> + return 0;
> +
> /* We only do this if the grp has never been initialized */
> if (unlikely(EXT4_MB_GRP_NEED_INIT(grp))) {
> int ret = ext4_mb_init_group(ac->ac_sb, group);
> @@ -4673,6 +4691,10 @@ do_more:
> overflow = 0;
> ext4_get_group_no_and_offset(sb, block, &block_group, &bit);
>
> + if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(
> + ext4_get_group_info(sb, block_group))))
> + return;
> +
> /*
> * Check to see if we are freeing blocks across a group
> * boundary.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 3/5] ext4: Mark block group as corrupt on block bitmap error
2013-07-23 3:38 ` Darrick J. Wong
@ 2013-08-28 22:26 ` Theodore Ts'o
0 siblings, 0 replies; 22+ messages in thread
From: Theodore Ts'o @ 2013-08-28 22:26 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: linux-ext4
On Mon, Jul 22, 2013 at 08:38:33PM -0700, Darrick J. Wong wrote:
> On Fri, Jul 19, 2013 at 04:55:52PM -0700, Darrick J. Wong wrote:
> > When we notice a block-bitmap corruption (because of device failure or
> > something else), we should mark this group as corrupt and prevent further block
> > allocations/deallocations from it. Currently, we end up generating one error
> > message for every block in the bitmap. This potentially could make the system
> > unstable as noticed in some bugs. With this patch, the error will be printed
> > only the first time and mark the entire block group as corrupted. This prevents
> > future access allocations/deallocations from it.
Thanks, applied....
> Hmm. I think we need to have ext4_count_free_clusters() act as though corrupt
> block groups have "zero" free blocks so that mballoc will pass the -ENOSPC
> errors back to the upper layers. Afaict, if one doesn't do this, ext4
> encounters the situation where marking the blocks in use fails, yet the fs
> thinks there are free blocks still and ... leaves the pages dirty forever,
> instead of simply failing.
Yes, that's something we should probably add to make things to be more
robust in the case where we have huge numbers of corrupted (or
hardware failures) in the block bitmaps.
> Just trying this really quickly, if I blast /all/ the block groups, I see
> unstoppable errors in dmesg.
What sort of errors did you end up seeing?
> The other thing I noticed is that if one turns delalloc mode on, performs a
> live corruption of the bg descriptors, and then dd's a big file to the fs,
> there's no error reported back to userspace either in write(), sync(), or even
> umount(). Meanwhile, dmesg is getting hit with tons of corrupted-bitmap
> errors.
I'm not sure there's much we can do about this. On the other hand,
how realistic of a threat is this. If it's happening randomly, how
likely is this to happen? And if it's a deliberate corruption, the
attacker can probably do a lot worse.
In practice, these weren't think we were really worried about when we
primarily worried about hardware failures, since hardware failures are
random, and so if the errors are affecting a large number of block
bitmaps, the storage device is probably completely toasted and there's
nothing we can do about it anyway.
When metadata checksums are enabled, this gets trickier, since it's
possible for a large number of metadata checksums to be corrupted in
the bg descriptors, especially if the bg descriptors get written to
while the file system is mounted. This will smash a huge number of
checksums, and then badness will happen. But realistically, bad
things would happen if that happened while the file system is mounted
even without checksums being enabled. It maybe that the best thing we
can do is to some kind of rate limiting with log messages, or some
kind of hueristic where if a sufficient number of different checksums
are found to be broken, we take much more drastic, such as
unconditionally shutting down the file system.
The main issue here is that errors=continue is used if we want to do
some amount of recovery after certain types of file system corruption,
but what we really need is a mode where we can continue after certain
types of fs errors (especially if userspace is doing things its own
data block checksums and has its own recovery mechanisms at the
cluster file system level). But if things gets really, really, bad,
we shouldn't trying to bull ahead in the face of errors when it's
clear it's going to be counterproductive.
> More for me to ponder....
Indeed...
- Ted
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 4/5] ext4: Mark block group as corrupt on inode bitmap error
2013-07-19 23:55 [PATCH v1 0/5] ext4: Shut down block groups when damage is detected Darrick J. Wong
` (2 preceding siblings ...)
2013-07-19 23:55 ` [PATCH 3/5] ext4: Mark block group as corrupt on block bitmap error Darrick J. Wong
@ 2013-07-19 23:55 ` Darrick J. Wong
2013-07-24 7:22 ` Zheng Liu
2013-07-19 23:56 ` [PATCH 5/5] ext4: Mark group corrupt on group descriptor checksum error Darrick J. Wong
2013-07-21 14:32 ` [PATCH v1 0/5] ext4: Shut down block groups when damage is detected Zheng Liu
5 siblings, 1 reply; 22+ messages in thread
From: Darrick J. Wong @ 2013-07-19 23:55 UTC (permalink / raw)
To: Theodore Ts'o, Darrick J. Wong; +Cc: linux-ext4
If we detect either a discrepancy between the inode bitmap and the inode counts
or the inode bitmap fails to pass validation checks, mark the block group
corrupt and refuse to allocate or deallocate inodes from the group.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
fs/ext4/ext4.h | 3 +++
fs/ext4/ialloc.c | 29 +++++++++++++++++++++++++----
2 files changed, 28 insertions(+), 4 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 45cc955..b8ac53d 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2449,11 +2449,14 @@ struct ext4_group_info {
#define EXT4_GROUP_INFO_NEED_INIT_BIT 0
#define EXT4_GROUP_INFO_WAS_TRIMMED_BIT 1
#define EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT 2
+#define EXT4_GROUP_INFO_IBITMAP_CORRUPT_BIT 4
#define EXT4_MB_GRP_NEED_INIT(grp) \
(test_bit(EXT4_GROUP_INFO_NEED_INIT_BIT, &((grp)->bb_state)))
#define EXT4_MB_GRP_BBITMAP_CORRUPT(grp) \
(test_bit(EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT, &((grp)->bb_state)))
+#define EXT4_MB_GRP_IBITMAP_CORRUPT(grp) \
+ (test_bit(EXT4_GROUP_INFO_IBITMAP_CORRUPT_BIT, &((grp)->bb_state)))
#define EXT4_MB_GRP_WAS_TRIMMED(grp) \
(test_bit(EXT4_GROUP_INFO_WAS_TRIMMED_BIT, &((grp)->bb_state)))
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index f03598c..08c7fa7 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -117,6 +117,7 @@ ext4_read_inode_bitmap(struct super_block *sb, ext4_group_t block_group)
struct ext4_group_desc *desc;
struct buffer_head *bh = NULL;
ext4_fsblk_t bitmap_blk;
+ struct ext4_group_info *grp;
desc = ext4_get_group_desc(sb, block_group, NULL);
if (!desc)
@@ -185,6 +186,8 @@ verify:
put_bh(bh);
ext4_error(sb, "Corrupt inode bitmap - block_group = %u, "
"inode_bitmap = %llu", block_group, bitmap_blk);
+ grp = ext4_get_group_info(sb, block_group);
+ set_bit(EXT4_GROUP_INFO_IBITMAP_CORRUPT_BIT, &grp->bb_state);
return NULL;
}
ext4_unlock_group(sb, block_group);
@@ -221,6 +224,7 @@ void ext4_free_inode(handle_t *handle, struct inode *inode)
struct ext4_super_block *es;
struct ext4_sb_info *sbi;
int fatal = 0, err, count, cleared;
+ struct ext4_group_info *grp;
if (!sb) {
printk(KERN_ERR "EXT4-fs: %s:%d: inode on "
@@ -266,7 +270,9 @@ void ext4_free_inode(handle_t *handle, struct inode *inode)
block_group = (ino - 1) / EXT4_INODES_PER_GROUP(sb);
bit = (ino - 1) % EXT4_INODES_PER_GROUP(sb);
bitmap_bh = ext4_read_inode_bitmap(sb, block_group);
- if (!bitmap_bh)
+ /* Don't bother if the inode bitmap is corrupt. */
+ grp = ext4_get_group_info(sb, block_group);
+ if (unlikely(EXT4_MB_GRP_IBITMAP_CORRUPT(grp)) || !bitmap_bh)
goto error_return;
BUFFER_TRACE(bitmap_bh, "get_write_access");
@@ -315,8 +321,10 @@ out:
err = ext4_handle_dirty_metadata(handle, NULL, bitmap_bh);
if (!fatal)
fatal = err;
- } else
+ } else {
ext4_error(sb, "bit already cleared for inode %lu", ino);
+ set_bit(EXT4_GROUP_INFO_IBITMAP_CORRUPT_BIT, &grp->bb_state);
+ }
error_return:
brelse(bitmap_bh);
@@ -652,6 +660,7 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
struct inode *ret;
ext4_group_t i;
ext4_group_t flex_group;
+ struct ext4_group_info *grp;
/* Cannot create files in a deleted directory */
if (!dir || !dir->i_nlink)
@@ -725,10 +734,22 @@ got_group:
continue;
}
+ grp = ext4_get_group_info(sb, group);
+ /* Skip groups with already-known suspicious inode tables */
+ if (EXT4_MB_GRP_IBITMAP_CORRUPT(grp)) {
+ if (++group == ngroups)
+ group = 0;
+ continue;
+ }
+
brelse(inode_bitmap_bh);
inode_bitmap_bh = ext4_read_inode_bitmap(sb, group);
- if (!inode_bitmap_bh)
- goto out;
+ /* Skip groups with suspicious inode tables */
+ if (EXT4_MB_GRP_IBITMAP_CORRUPT(grp) || !inode_bitmap_bh) {
+ if (++group == ngroups)
+ group = 0;
+ continue;
+ }
repeat_in_this_group:
ino = ext4_find_next_zero_bit((unsigned long *)
^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH 4/5] ext4: Mark block group as corrupt on inode bitmap error
2013-07-19 23:55 ` [PATCH 4/5] ext4: Mark block group as corrupt on inode " Darrick J. Wong
@ 2013-07-24 7:22 ` Zheng Liu
2013-08-28 22:45 ` Theodore Ts'o
0 siblings, 1 reply; 22+ messages in thread
From: Zheng Liu @ 2013-07-24 7:22 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: Theodore Ts'o, linux-ext4
On Fri, Jul 19, 2013 at 04:55:59PM -0700, Darrick J. Wong wrote:
> If we detect either a discrepancy between the inode bitmap and the inode counts
> or the inode bitmap fails to pass validation checks, mark the block group
> corrupt and refuse to allocate or deallocate inodes from the group.
>
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
> fs/ext4/ext4.h | 3 +++
> fs/ext4/ialloc.c | 29 +++++++++++++++++++++++++----
> 2 files changed, 28 insertions(+), 4 deletions(-)
>
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 45cc955..b8ac53d 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -2449,11 +2449,14 @@ struct ext4_group_info {
> #define EXT4_GROUP_INFO_NEED_INIT_BIT 0
> #define EXT4_GROUP_INFO_WAS_TRIMMED_BIT 1
> #define EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT 2
> +#define EXT4_GROUP_INFO_IBITMAP_CORRUPT_BIT 4
>
> #define EXT4_MB_GRP_NEED_INIT(grp) \
> (test_bit(EXT4_GROUP_INFO_NEED_INIT_BIT, &((grp)->bb_state)))
> #define EXT4_MB_GRP_BBITMAP_CORRUPT(grp) \
> (test_bit(EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT, &((grp)->bb_state)))
> +#define EXT4_MB_GRP_IBITMAP_CORRUPT(grp) \
> + (test_bit(EXT4_GROUP_INFO_IBITMAP_CORRUPT_BIT, &((grp)->bb_state)))
>
> #define EXT4_MB_GRP_WAS_TRIMMED(grp) \
> (test_bit(EXT4_GROUP_INFO_WAS_TRIMMED_BIT, &((grp)->bb_state)))
> diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
> index f03598c..08c7fa7 100644
> --- a/fs/ext4/ialloc.c
> +++ b/fs/ext4/ialloc.c
> @@ -117,6 +117,7 @@ ext4_read_inode_bitmap(struct super_block *sb, ext4_group_t block_group)
> struct ext4_group_desc *desc;
> struct buffer_head *bh = NULL;
> ext4_fsblk_t bitmap_blk;
> + struct ext4_group_info *grp;
>
> desc = ext4_get_group_desc(sb, block_group, NULL);
> if (!desc)
> @@ -185,6 +186,8 @@ verify:
> put_bh(bh);
> ext4_error(sb, "Corrupt inode bitmap - block_group = %u, "
> "inode_bitmap = %llu", block_group, bitmap_blk);
> + grp = ext4_get_group_info(sb, block_group);
> + set_bit(EXT4_GROUP_INFO_IBITMAP_CORRUPT_BIT, &grp->bb_state);
> return NULL;
> }
> ext4_unlock_group(sb, block_group);
> @@ -221,6 +224,7 @@ void ext4_free_inode(handle_t *handle, struct inode *inode)
> struct ext4_super_block *es;
> struct ext4_sb_info *sbi;
> int fatal = 0, err, count, cleared;
> + struct ext4_group_info *grp;
>
> if (!sb) {
> printk(KERN_ERR "EXT4-fs: %s:%d: inode on "
> @@ -266,7 +270,9 @@ void ext4_free_inode(handle_t *handle, struct inode *inode)
> block_group = (ino - 1) / EXT4_INODES_PER_GROUP(sb);
> bit = (ino - 1) % EXT4_INODES_PER_GROUP(sb);
> bitmap_bh = ext4_read_inode_bitmap(sb, block_group);
> - if (!bitmap_bh)
> + /* Don't bother if the inode bitmap is corrupt. */
> + grp = ext4_get_group_info(sb, block_group);
> + if (unlikely(EXT4_MB_GRP_IBITMAP_CORRUPT(grp)) || !bitmap_bh)
> goto error_return;
It seems that this is a duplicated check. If we encounters a currupted
inode bitmap, ext4_read_inode_bitmap() will return null.
>
> BUFFER_TRACE(bitmap_bh, "get_write_access");
> @@ -315,8 +321,10 @@ out:
> err = ext4_handle_dirty_metadata(handle, NULL, bitmap_bh);
> if (!fatal)
> fatal = err;
> - } else
> + } else {
> ext4_error(sb, "bit already cleared for inode %lu", ino);
> + set_bit(EXT4_GROUP_INFO_IBITMAP_CORRUPT_BIT, &grp->bb_state);
> + }
>
> error_return:
> brelse(bitmap_bh);
> @@ -652,6 +660,7 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
> struct inode *ret;
> ext4_group_t i;
> ext4_group_t flex_group;
> + struct ext4_group_info *grp;
>
> /* Cannot create files in a deleted directory */
> if (!dir || !dir->i_nlink)
> @@ -725,10 +734,22 @@ got_group:
> continue;
> }
>
> + grp = ext4_get_group_info(sb, group);
> + /* Skip groups with already-known suspicious inode tables */
> + if (EXT4_MB_GRP_IBITMAP_CORRUPT(grp)) {
> + if (++group == ngroups)
> + group = 0;
> + continue;
> + }
> +
> brelse(inode_bitmap_bh);
> inode_bitmap_bh = ext4_read_inode_bitmap(sb, group);
> - if (!inode_bitmap_bh)
> - goto out;
> + /* Skip groups with suspicious inode tables */
> + if (EXT4_MB_GRP_IBITMAP_CORRUPT(grp) || !inode_bitmap_bh) {
> + if (++group == ngroups)
> + group = 0;
> + continue;
> + }
The same as above.
- Zheng
>
> repeat_in_this_group:
> ino = ext4_find_next_zero_bit((unsigned long *)
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 4/5] ext4: Mark block group as corrupt on inode bitmap error
2013-07-24 7:22 ` Zheng Liu
@ 2013-08-28 22:45 ` Theodore Ts'o
0 siblings, 0 replies; 22+ messages in thread
From: Theodore Ts'o @ 2013-08-28 22:45 UTC (permalink / raw)
To: Darrick J. Wong, linux-ext4
On Wed, Jul 24, 2013 at 03:22:52PM +0800, Zheng Liu wrote:
> On Fri, Jul 19, 2013 at 04:55:59PM -0700, Darrick J. Wong wrote:
> > If we detect either a discrepancy between the inode bitmap and the inode counts
> > or the inode bitmap fails to pass validation checks, mark the block group
> > corrupt and refuse to allocate or deallocate inodes from the group.
> >
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Thanks, applied.
> > @@ -266,7 +270,9 @@ void ext4_free_inode(handle_t *handle, struct inode *inode)
> > block_group = (ino - 1) / EXT4_INODES_PER_GROUP(sb);
> > bit = (ino - 1) % EXT4_INODES_PER_GROUP(sb);
> > bitmap_bh = ext4_read_inode_bitmap(sb, block_group);
> > - if (!bitmap_bh)
> > + /* Don't bother if the inode bitmap is corrupt. */
> > + grp = ext4_get_group_info(sb, block_group);
> > + if (unlikely(EXT4_MB_GRP_IBITMAP_CORRUPT(grp)) || !bitmap_bh)
> > goto error_return;
>
> It seems that this is a duplicated check. If we encounters a currupted
> inode bitmap, ext4_read_inode_bitmap() will return null.
If an already released inode is freed, the IBITMAP_CORRUPT bit can be
set even though the checksum is valid and ext4_read_inode_bitmap()
returns a non-null bh pointer.
- Ted
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 5/5] ext4: Mark group corrupt on group descriptor checksum error
2013-07-19 23:55 [PATCH v1 0/5] ext4: Shut down block groups when damage is detected Darrick J. Wong
` (3 preceding siblings ...)
2013-07-19 23:55 ` [PATCH 4/5] ext4: Mark block group as corrupt on inode " Darrick J. Wong
@ 2013-07-19 23:56 ` Darrick J. Wong
2013-08-28 22:49 ` Theodore Ts'o
2013-07-21 14:32 ` [PATCH v1 0/5] ext4: Shut down block groups when damage is detected Zheng Liu
5 siblings, 1 reply; 22+ messages in thread
From: Darrick J. Wong @ 2013-07-19 23:56 UTC (permalink / raw)
To: Theodore Ts'o, Darrick J. Wong; +Cc: linux-ext4
If the group descriptor fails validation, mark the whole blockgroup corrupt so
that the inode/block allocators skip this group. The previous approach takes
the risk of writing to a damaged group descriptor; hopefully it was never the
case that the [ib]bitmap fields pointed to another valid block and got dirtied,
since the memset would fill the page with 1s.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
fs/ext4/balloc.c | 9 ++++-----
fs/ext4/ialloc.c | 10 ++++------
2 files changed, 8 insertions(+), 11 deletions(-)
diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index b4c406b..f19e94d 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -184,6 +184,7 @@ void ext4_init_block_bitmap(struct super_block *sb, struct buffer_head *bh,
struct ext4_sb_info *sbi = EXT4_SB(sb);
ext4_fsblk_t start, tmp;
int flex_bg = 0;
+ struct ext4_group_info *grp;
J_ASSERT_BH(bh, buffer_locked(bh));
@@ -191,11 +192,9 @@ void ext4_init_block_bitmap(struct super_block *sb, struct buffer_head *bh,
* essentially implementing a per-group read-only flag. */
if (!ext4_group_desc_csum_verify(sb, block_group, gdp)) {
ext4_error(sb, "Checksum bad for group %u", block_group);
- ext4_free_group_clusters_set(sb, gdp, 0);
- ext4_free_inodes_set(sb, gdp, 0);
- ext4_itable_unused_set(sb, gdp, 0);
- memset(bh->b_data, 0xff, sb->s_blocksize);
- ext4_block_bitmap_csum_set(sb, block_group, gdp, bh);
+ grp = ext4_get_group_info(sb, block_group);
+ set_bit(EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT, &grp->bb_state);
+ set_bit(EXT4_GROUP_INFO_IBITMAP_CORRUPT_BIT, &grp->bb_state);
return;
}
memset(bh->b_data, 0, sb->s_blocksize);
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 08c7fa7..f75be27 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -70,18 +70,16 @@ static unsigned ext4_init_inode_bitmap(struct super_block *sb,
ext4_group_t block_group,
struct ext4_group_desc *gdp)
{
+ struct ext4_group_info *grp;
J_ASSERT_BH(bh, buffer_locked(bh));
/* If checksum is bad mark all blocks and inodes use to prevent
* allocation, essentially implementing a per-group read-only flag. */
if (!ext4_group_desc_csum_verify(sb, block_group, gdp)) {
ext4_error(sb, "Checksum bad for group %u", block_group);
- ext4_free_group_clusters_set(sb, gdp, 0);
- ext4_free_inodes_set(sb, gdp, 0);
- ext4_itable_unused_set(sb, gdp, 0);
- memset(bh->b_data, 0xff, sb->s_blocksize);
- ext4_inode_bitmap_csum_set(sb, block_group, gdp, bh,
- EXT4_INODES_PER_GROUP(sb) / 8);
+ grp = ext4_get_group_info(sb, block_group);
+ set_bit(EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT, &grp->bb_state);
+ set_bit(EXT4_GROUP_INFO_IBITMAP_CORRUPT_BIT, &grp->bb_state);
return 0;
}
^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH v1 0/5] ext4: Shut down block groups when damage is detected
2013-07-19 23:55 [PATCH v1 0/5] ext4: Shut down block groups when damage is detected Darrick J. Wong
` (4 preceding siblings ...)
2013-07-19 23:56 ` [PATCH 5/5] ext4: Mark group corrupt on group descriptor checksum error Darrick J. Wong
@ 2013-07-21 14:32 ` Zheng Liu
2013-07-29 15:28 ` Jeff Moyer
5 siblings, 1 reply; 22+ messages in thread
From: Zheng Liu @ 2013-07-21 14:32 UTC (permalink / raw)
To: Darrick J. Wong, Theodore Ts'o; +Cc: linux-ext4
On Fri, Jul 19, 2013 at 04:55:32PM -0700, Darrick J. Wong wrote:
> Right now, ext4 doesn't do quite a good enough job shutting off allocation and
> freeing activity in block groups when damage is detected, which means that ext4
> can obliviously load a corrupt bitmap, base allocation decisions off of that,
> and trash the filesystem. We'd like to be able to freeze the block group when
> this happens, so hopefully the next fsck can repair the damage.
>
> The first patch fixes the behavior that a corrupt bitmap can be returned to
> mballoc as if it was accurate. The second patch is a trivial fix, and the two
> after it provide for detecting damage in either the block bitmap or the inode
> bitmap, and disabling all allocation/deallocation activity in the block group.
> The final patch changes runtime block group descriptor validation failure
> behavior to use the corruption flag to mark off the block group.
>
> This patchset has been tested (albeit lightly) against 3.11-rc1 on x64. I'm
> wondering about a few things -- if we detect corrupt *inodes*, should we invoke
> this mechanism as well? Second, as I mentioned a few days ago, maybe it's time
> for block_validity to be set always, since it seems to have a low speed impact?
> Third, the block bitmap corruption flag patch is based off of Aditya Kali's
> patch that you forwarded; can a proper Signed-off-by be attached since I mostly
> just massaged that one into 3.11?
>
> Comments and questions are, as always, welcome.
Wow, it seems to me that I have missed a very important thread [1] after
serveral crazy busy weeks. There is an idea that is in my mind for a
while and I still can not have a proper time to try it.
My idea is to let file system can ignore the currurted block. Namely,
when we meet a currupted block, we will track it as bad block in bad
block inode and find another block to save data. This currupted block
will never be used. The first step in my mind is to detect a currpted
block and mark it as bad block. After reading the thread and Darrick's
original patch, I think Darrick's patch is a good start.
At Taobao, we have a large CDN system. These servers are a cache for
web site, and this system can tolerate the data loss. So we hope when
we detect a currupted block, we can just ignore it and use another block
until the whole disk currupted or the server is dropped.
I will take a closer look at these patches later.
Thanks,
- Zheng
1. http://www.spinics.net/lists/linux-ext4/msg39053.html
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v1 0/5] ext4: Shut down block groups when damage is detected
2013-07-21 14:32 ` [PATCH v1 0/5] ext4: Shut down block groups when damage is detected Zheng Liu
@ 2013-07-29 15:28 ` Jeff Moyer
2013-07-30 0:31 ` Zheng Liu
` (2 more replies)
0 siblings, 3 replies; 22+ messages in thread
From: Jeff Moyer @ 2013-07-29 15:28 UTC (permalink / raw)
To: Zheng Liu; +Cc: Darrick J. Wong, Theodore Ts'o, linux-ext4
Zheng Liu <gnehzuil.liu@gmail.com> writes:
> My idea is to let file system can ignore the currurted block. Namely,
> when we meet a currupted block, we will track it as bad block in bad
> block inode and find another block to save data. This currupted block
> will never be used. The first step in my mind is to detect a currpted
> block and mark it as bad block. After reading the thread and Darrick's
> original patch, I think Darrick's patch is a good start.
I think it's important to call out the exact failure scenario you're
trying to address. For hard disks, if you get a read error, it can
typically be recovered by re-writing the block. I imagine this is what
fsck would be doing for metadata repair. So, I'm not at all sure why
you'd want to track bad blocks in the file system itself. Could you
elaborate, please?
Cheers,
Jeff
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v1 0/5] ext4: Shut down block groups when damage is detected
2013-07-29 15:28 ` Jeff Moyer
@ 2013-07-30 0:31 ` Zheng Liu
2013-07-31 18:52 ` Jan Kara
2013-07-30 1:57 ` Theodore Ts'o
2013-08-10 6:02 ` Darrick J. Wong
2 siblings, 1 reply; 22+ messages in thread
From: Zheng Liu @ 2013-07-30 0:31 UTC (permalink / raw)
To: Jeff Moyer; +Cc: Darrick J. Wong, Theodore Ts'o, linux-ext4
Hi Jeff,
On Mon, Jul 29, 2013 at 11:28:38AM -0400, Jeff Moyer wrote:
> Zheng Liu <gnehzuil.liu@gmail.com> writes:
>
> > My idea is to let file system can ignore the currurted block. Namely,
> > when we meet a currupted block, we will track it as bad block in bad
> > block inode and find another block to save data. This currupted block
> > will never be used. The first step in my mind is to detect a currpted
> > block and mark it as bad block. After reading the thread and Darrick's
> > original patch, I think Darrick's patch is a good start.
>
> I think it's important to call out the exact failure scenario you're
> trying to address. For hard disks, if you get a read error, it can
> typically be recovered by re-writing the block. I imagine this is what
> fsck would be doing for metadata repair. So, I'm not at all sure why
> you'd want to track bad blocks in the file system itself. Could you
> elaborate, please?
In our product system at Taobao, we have a large CDN system around the
country. These servers cache the most of web pages, images, etc....
These servers have some disks, and the disk must break down at some
time. Now we need to umount this disk, and the whole disk just be left
in server until the whole server is dropped. But as you have pointed
out, when we meet a disk failure, the whole disk might still works. So
we hope that the file system could track the bad block, doesn't allocate
them, and the rest of spaces also can be used. This can help us to
reduce the cost.
As you said above, some faliure scenarios are hard to be addressed.
E.g., we couldn't read any data from the disk. But most scenarios are
that the disk just has some bad sectors. So that would be great if the
disk still can be used. In addition, we don't care about whether fsck
can fix these bad blocks because we don't want to reboot the server. As
I describe before, these servers are as a cache of web site. If they
are rebooted, they must take some time to preload the content from the
other servers and can not provide service. This is not better than what
we do now (umount the disk).
Certainly, this might makes no sense to SSD/Flash device because when we
get an error from these devices, it is possible that they couldn't be
used.
Regards,
- Zheng
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v1 0/5] ext4: Shut down block groups when damage is detected
2013-07-30 0:31 ` Zheng Liu
@ 2013-07-31 18:52 ` Jan Kara
2013-07-31 21:28 ` Theodore Ts'o
0 siblings, 1 reply; 22+ messages in thread
From: Jan Kara @ 2013-07-31 18:52 UTC (permalink / raw)
To: Zheng Liu; +Cc: Jeff Moyer, Darrick J. Wong, Theodore Ts'o, linux-ext4
On Tue 30-07-13 08:31:09, Zheng Liu wrote:
> Hi Jeff,
>
> On Mon, Jul 29, 2013 at 11:28:38AM -0400, Jeff Moyer wrote:
> > Zheng Liu <gnehzuil.liu@gmail.com> writes:
> >
> > > My idea is to let file system can ignore the currurted block. Namely,
> > > when we meet a currupted block, we will track it as bad block in bad
> > > block inode and find another block to save data. This currupted block
> > > will never be used. The first step in my mind is to detect a currpted
> > > block and mark it as bad block. After reading the thread and Darrick's
> > > original patch, I think Darrick's patch is a good start.
> >
> > I think it's important to call out the exact failure scenario you're
> > trying to address. For hard disks, if you get a read error, it can
> > typically be recovered by re-writing the block. I imagine this is what
> > fsck would be doing for metadata repair. So, I'm not at all sure why
> > you'd want to track bad blocks in the file system itself. Could you
> > elaborate, please?
>
> In our product system at Taobao, we have a large CDN system around the
> country. These servers cache the most of web pages, images, etc....
> These servers have some disks, and the disk must break down at some
> time. Now we need to umount this disk, and the whole disk just be left
> in server until the whole server is dropped. But as you have pointed
> out, when we meet a disk failure, the whole disk might still works. So
> we hope that the file system could track the bad block, doesn't allocate
> them, and the rest of spaces also can be used. This can help us to
> reduce the cost.
Well, before spending too much time with this, try finding some study
(I've read some from Google I think, just I don't have the url at hand) on
what is the estimated lifetime of a disk after bad sectors start appearing.
What I remember is that usually when bad sectors start appearing the disk
is going to die within weeks with high probability. So I'm not sure if the
cost saving of additional few weeks of lifetime is worth the trouble. As
Ted said, there may be other reasons why you'd want a feature like this -
kernel error causing bitmap corruption - or just that you need to keep the
machine up for a few more hours before you can take it down for
maintenance.
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v1 0/5] ext4: Shut down block groups when damage is detected
2013-07-31 18:52 ` Jan Kara
@ 2013-07-31 21:28 ` Theodore Ts'o
0 siblings, 0 replies; 22+ messages in thread
From: Theodore Ts'o @ 2013-07-31 21:28 UTC (permalink / raw)
To: Jan Kara; +Cc: Zheng Liu, Jeff Moyer, Darrick J. Wong, linux-ext4
On Wed, Jul 31, 2013 at 08:52:43PM +0200, Jan Kara wrote:
> Well, before spending too much time with this, try finding some study
> (I've read some from Google I think, just I don't have the url at hand) on
> what is the estimated lifetime of a disk after bad sectors start appearing.
Also worth considering is that error rate profile for flash devices
might be quite different from HDD's.
- Ted
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v1 0/5] ext4: Shut down block groups when damage is detected
2013-07-29 15:28 ` Jeff Moyer
2013-07-30 0:31 ` Zheng Liu
@ 2013-07-30 1:57 ` Theodore Ts'o
2013-08-10 6:02 ` Darrick J. Wong
2 siblings, 0 replies; 22+ messages in thread
From: Theodore Ts'o @ 2013-07-30 1:57 UTC (permalink / raw)
To: Jeff Moyer; +Cc: Zheng Liu, Darrick J. Wong, linux-ext4
On Mon, Jul 29, 2013 at 11:28:38AM -0400, Jeff Moyer wrote:
> I think it's important to call out the exact failure scenario you're
> trying to address. For hard disks, if you get a read error, it can
> typically be recovered by re-writing the block. I imagine this is what
> fsck would be doing for metadata repair. So, I'm not at all sure why
> you'd want to track bad blocks in the file system itself. Could you
> elaborate, please?
The basic idea why we had a similar patch in Google was so that when
we discovered a potential problem in an allocation bitmap (i.e.,
either a read error, or finding that we had freed a bloc/inode which
was already marked as freed), instead of panic'ing the entire server,
or remounting the file system read/only (or otherwise taking it
off-line), you can just avoid allocating any blocks/inodes in that
block group (since we can't trust the allocation bitmap), but we can
keep using the file system, in a somewhat degraded mode.
Of course, eventually you'd want to take the machine off-line and run
fsck on the whole thing, and then rewrite the broken allocation
bitmap. But in the meantime, there might be circumstances where it
would be inconvenient (or violate some pesky SLA :-), to take down the
server or even the individual storage device to run fsck on it.
Regards,
- Ted
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v1 0/5] ext4: Shut down block groups when damage is detected
2013-07-29 15:28 ` Jeff Moyer
2013-07-30 0:31 ` Zheng Liu
2013-07-30 1:57 ` Theodore Ts'o
@ 2013-08-10 6:02 ` Darrick J. Wong
2 siblings, 0 replies; 22+ messages in thread
From: Darrick J. Wong @ 2013-08-10 6:02 UTC (permalink / raw)
To: Jeff Moyer; +Cc: Zheng Liu, Theodore Ts'o, linux-ext4
On Mon, Jul 29, 2013 at 11:28:38AM -0400, Jeff Moyer wrote:
> Zheng Liu <gnehzuil.liu@gmail.com> writes:
>
> > My idea is to let file system can ignore the currurted block. Namely,
> > when we meet a currupted block, we will track it as bad block in bad
> > block inode and find another block to save data. This currupted block
> > will never be used. The first step in my mind is to detect a currpted
> > block and mark it as bad block. After reading the thread and Darrick's
> > original patch, I think Darrick's patch is a good start.
>
> I think it's important to call out the exact failure scenario you're
> trying to address. For hard disks, if you get a read error, it can
> typically be recovered by re-writing the block. I imagine this is what
> fsck would be doing for metadata repair. So, I'm not at all sure why
> you'd want to track bad blocks in the file system itself. Could you
> elaborate, please?
Sorry, was on vacation.
Actually, I had a few different scenarios in mind when I was writing this
patchset. I've observed that feeding bad block bitmaps to ext4 (misplaced
write, for example) results in it using the garbage bitmap for allocations and
smashing the fs. Metadata checksumming (and the other sanity checks) at least
help us to detect these situations, but even with that, the allocators will
repeatedly beat on broken block groups. The repetition causes log spew and in
a few cases the block allocator will go into an infinite loop looking for
space. This patch set fixes the log spew and the infinite loops.
--D
>
> Cheers,
> Jeff
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 22+ messages in thread