* ext34_free_inode's mess
@ 2010-04-14 11:19 Dmitry Monakhov
2010-04-14 11:23 ` [PATCH 1/2] ext3: fix inode bitmaps manipulation in free_inode Dmitry Monakhov
` (3 more replies)
0 siblings, 4 replies; 16+ messages in thread
From: Dmitry Monakhov @ 2010-04-14 11:19 UTC (permalink / raw)
To: ext4 development; +Cc: Jan Kara
[-- Attachment #1: Type: text/plain, Size: 1158 bytes --]
I've finally automated my favorite testcase (see attachment),
before i've run it by hand.
And sometimes i've saw following complain from fsck:
fsck.ext4 -f -n /dev/sdb2
...
Pass 5: Checking group summary information
Inode bitmap differences: -93582
Fix? no
Free inodes count wrong for group #12 (4634, counted=4633).
Fix? no
Free inodes count wrong (35610, counted=35609).
Fix? no
...
I've started to look an inode bitmap manipulation code paths
and found strange logic in ext{3,4}_free_inode functions
1) Group lock acquired twice for bitmap and for group_desc.
There are not any advantage from this double locking, only
error path(where the bit is already cleared) takes an
advantage from this locking schema.
It is reasonable to batch it in to one locking block.
2) if we failed to read gdp then bh2 is undefined so
may result in oops due to undefince pointer dereferance.
3) if we failed to get write_access to gdp we skip
handle_dirty_metadata for inode_bitmap which is also a bug.
I've redesigned free_inode logic(see later two emails) and
currently i'm not able to reproduce the bug, but i can not
guarantee it is goes away.
[-- Attachment #2: 0001-xfstests-dev-add-one-more-stress-test.patch --]
[-- Type: text/plain, Size: 4375 bytes --]
>From 1857fc6c7349a67cf930e73b802427a138e43456 Mon Sep 17 00:00:00 2001
From: Dmitry Monakhov <dmonakhov@openvz.org>
Date: Wed, 14 Apr 2010 14:53:47 +0400
Subject: [PATCH] xfstests-dev: add one more stress test
During stress testing we want to cover most of code paths.
fsstress is very good for this purpose. But it has expandable
nature (disk usage almost continually grow). So once we
goes it no ENOSPC condition we will be where till the end.
But by running 'dd' in parallel we can regularly trigger
ENOSPC but only for a limited periods of time.
This is my favorite stress test-case configuration.
---
227 | 105 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
227.out | 5 +++
group | 1 +
3 files changed, 111 insertions(+), 0 deletions(-)
create mode 100755 227
create mode 100644 227.out
diff --git a/227 b/227
new file mode 100755
index 0000000..d2b0c7d
--- /dev/null
+++ b/227
@@ -0,0 +1,105 @@
+#! /bin/bash
+# FS QA Test No. 227
+#
+# Perform fsstress test with parallel dd
+# This proven to be a good stress test
+# * Continuous dd retult in ENOSPC condition but only for a limited periods
+# of time.
+# * Fsstress test cover many code paths
+#
+#-----------------------------------------------------------------------
+# Copyright (c) 2010 Dmitry Monakhov. All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
+#
+#-----------------------------------------------------------------------
+#
+# creator
+owner=dmonakhov@openvz.org
+
+seq=`basename $0`
+echo "QA output created by $seq"
+here=`pwd`
+tmp=/tmp/$$
+status=1 # failure is the default!
+
+_cleanup()
+{
+ rm -f $tmp.*
+}
+
+workout()
+{
+ # Disable bash job controll, to prevent message about killed task.
+ set +m
+
+ #Timing parameters
+ nr_iterations=5
+ kill_tries=20
+ echo Running fsstress. | tee -a $seq.full
+
+####################################################
+## -f unresvsp=0 -f allocsp=0 -f freesp=0 \
+## -f setxattr=0 -f attr_remove=0 -f attr_set=0 \
+##
+######################################################
+ mkdir -p $SCRATCH_MNT/fsstress
+ # It is reasonable to disable sync, otherwise most of tasks will simply
+ # stuck in that sync() call.
+ $FSSTRESS_PROG \
+ -d $SCRATCH_MNT/fsstress \
+ -p 100 -f sync=0 -n 9999999 > /dev/null 2>&1 &
+
+ echo Running ENOSPC hitters. | tee -a $seq.full
+ for ((i = 0; i < $nr_iterations; i++))
+ do
+ #Open with O_TRUNC and then write until error
+ #hit ENOSPC each time.
+ dd if=/dev/zero of=$SCRATCH_MNT/BIG_FILE bs=1M 2> /dev/null
+ done
+
+ for ((i = 0; i < $kill_tries; i++))
+ do
+ killall -r -q -TERM fsstress 2> /dev/null
+ sleep 1
+ done
+}
+
+trap "_cleanup ; exit \$status" 0 1 2 3 15
+
+# get standard environment, filters and checks
+. ./common.rc
+. ./common.filter
+
+# real QA test starts here
+_supported_fs generic
+_supported_os Linux
+_require_scratch
+
+rm -f $seq.full
+
+umount $TEST_DEV >/dev/null 2>&1
+umount $SCRATCH_DEV >/dev/null 2>&1
+echo "*** MKFS ***" >>$seq.full
+echo "" >>$seq.full
+_scratch_mkfs >/dev/null 2>&1 || _fail "mkfs failed"
+_scratch_mount >/dev/null 2>&1 || _fail "mount failed"
+
+workout
+umount $SCRATCH_MNT
+echo
+echo Checking filesystem
+_check_scratch_fs
+status=$?
+exit
diff --git a/227.out b/227.out
new file mode 100644
index 0000000..6a7342d
--- /dev/null
+++ b/227.out
@@ -0,0 +1,5 @@
+QA output created by 227
+Running fsstress.
+Running ENOSPC hitters.
+
+Checking filesystem
diff --git a/group b/group
index 8d4a83a..81a2aa4 100644
--- a/group
+++ b/group
@@ -339,3 +339,4 @@ deprecated
223 auto quick
224 auto
225 auto quick
+227 rw auto prealloc enospc
\ No newline at end of file
--
1.6.6
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH 1/2] ext3: fix inode bitmaps manipulation in free_inode
2010-04-14 11:19 ext34_free_inode's mess Dmitry Monakhov
@ 2010-04-14 11:23 ` Dmitry Monakhov
2010-04-14 11:23 ` [PATCH 2/2] ext4: " Dmitry Monakhov
2010-04-14 11:35 ` ext34_free_inode's mess Dmitry Monakhov
` (2 subsequent siblings)
3 siblings, 1 reply; 16+ messages in thread
From: Dmitry Monakhov @ 2010-04-14 11:23 UTC (permalink / raw)
To: linux-ext4; +Cc: jack, Dmitry Monakhov
- Reorganize locking scheme to batch two atomic operation in to one.
- Fix possible undefined pointer deference.
- Even if group descriptor stats aren't assessable we have to update
inode bitmaps.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
fs/ext3/ialloc.c | 62 +++++++++++++++++++++++++++--------------------------
1 files changed, 32 insertions(+), 30 deletions(-)
diff --git a/fs/ext3/ialloc.c b/fs/ext3/ialloc.c
index ef9008b..8352a68 100644
--- a/fs/ext3/ialloc.c
+++ b/fs/ext3/ialloc.c
@@ -98,7 +98,7 @@ void ext3_free_inode (handle_t *handle, struct inode * inode)
struct ext3_group_desc * gdp;
struct ext3_super_block * es;
struct ext3_sb_info *sbi;
- int fatal = 0, err;
+ int fatal = 0, err, cleared = 0;
if (atomic_read(&inode->i_count) > 1) {
printk ("ext3_free_inode: inode has count=%d\n",
@@ -150,38 +150,40 @@ void ext3_free_inode (handle_t *handle, struct inode * inode)
if (fatal)
goto error_return;
- /* Ok, now we can actually update the inode bitmaps.. */
- if (!ext3_clear_bit_atomic(sb_bgl_lock(sbi, block_group),
- bit, bitmap_bh->b_data))
- ext3_error (sb, "ext3_free_inode",
- "bit already cleared for inode %lu", ino);
- else {
- gdp = ext3_get_group_desc (sb, block_group, &bh2);
-
+ fatal = -ESRCH;
+ gdp = ext3_get_group_desc (sb, block_group, &bh2);
+ if (gdp) {
BUFFER_TRACE(bh2, "get_write_access");
fatal = ext3_journal_get_write_access(handle, bh2);
- if (fatal) goto error_return;
-
- if (gdp) {
- spin_lock(sb_bgl_lock(sbi, block_group));
- le16_add_cpu(&gdp->bg_free_inodes_count, 1);
- if (is_directory)
- le16_add_cpu(&gdp->bg_used_dirs_count, -1);
- spin_unlock(sb_bgl_lock(sbi, block_group));
- percpu_counter_inc(&sbi->s_freeinodes_counter);
- if (is_directory)
- percpu_counter_dec(&sbi->s_dirs_counter);
-
- }
- BUFFER_TRACE(bh2, "call ext3_journal_dirty_metadata");
- err = ext3_journal_dirty_metadata(handle, bh2);
- if (!fatal) fatal = err;
}
- BUFFER_TRACE(bitmap_bh, "call ext3_journal_dirty_metadata");
- err = ext3_journal_dirty_metadata(handle, bitmap_bh);
- if (!fatal)
- fatal = err;
+ spin_lock(sb_bgl_lock(sbi, block_group));
+ if (fatal) {
+ /* Skip group descriptor update, update only inode bitmaps */
+ cleared = ext3_clear_bit(bit, bitmap_bh->b_data);
+ spin_unlock(sb_bgl_lock(sbi, block_group));
+ goto out;
+ }
+ /* Ok, now we can actually update the inode bitmaps.. */
+ cleared = ext3_clear_bit(bit, bitmap_bh->b_data);
+ if (!cleared) {
+ spin_unlock(sb_bgl_lock(sbi, block_group));
+ goto out;
+ }
+ le16_add_cpu(&gdp->bg_free_inodes_count, 1);
+ if (is_directory)
+ le16_add_cpu(&gdp->bg_used_dirs_count, -1);
+ spin_unlock(sb_bgl_lock(sbi, block_group));
+ percpu_counter_inc(&sbi->s_freeinodes_counter);
+ if (is_directory)
+ percpu_counter_dec(&sbi->s_dirs_counter);
+ BUFFER_TRACE(bh2, "call ext3_journal_dirty_metadata");
+ err = ext3_journal_dirty_metadata(handle, bh2);
+out:
+ if (cleared) {
+ BUFFER_TRACE(bitmap_bh, "call ext3_journal_dirty_metadata");
+ fatal = ext3_journal_dirty_metadata(handle, bitmap_bh);
+ }
error_return:
brelse(bitmap_bh);
ext3_std_error(sb, fatal);
--
1.6.6.1
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH 2/2] ext4: fix inode bitmaps manipulation in free_inode
2010-04-14 11:23 ` [PATCH 1/2] ext3: fix inode bitmaps manipulation in free_inode Dmitry Monakhov
@ 2010-04-14 11:23 ` Dmitry Monakhov
2010-04-15 0:12 ` tytso
0 siblings, 1 reply; 16+ messages in thread
From: Dmitry Monakhov @ 2010-04-14 11:23 UTC (permalink / raw)
To: linux-ext4; +Cc: jack, Dmitry Monakhov
- Reorganize locking scheme to batch two atomic operation in to one.
This also allow us to state what healthy group must obey following rule
ext4_free_inodes_count(sb, gdp) == ext4_count_free(inode_bitmap, NUM);
- Fix possible undefined pointer deference.
- Even if group descriptor stats aren't assessable we have to update
inode bitmaps.
- Move non group members update out of group_lock.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
fs/ext4/ialloc.c | 91 +++++++++++++++++++++++++++--------------------------
1 files changed, 46 insertions(+), 45 deletions(-)
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 57f6eef..78ceab5 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -240,59 +240,60 @@ void ext4_free_inode(handle_t *handle, struct inode *inode)
if (fatal)
goto error_return;
- /* Ok, now we can actually update the inode bitmaps.. */
- cleared = ext4_clear_bit_atomic(ext4_group_lock_ptr(sb, block_group),
- bit, bitmap_bh->b_data);
- if (!cleared)
- ext4_error(sb, "bit already cleared for inode %lu", ino);
- else {
- gdp = ext4_get_group_desc(sb, block_group, &bh2);
-
+ fatal = -ESRCH;
+ gdp = ext4_get_group_desc(sb, block_group, &bh2);
+ if (gdp) {
BUFFER_TRACE(bh2, "get_write_access");
fatal = ext4_journal_get_write_access(handle, bh2);
- if (fatal) goto error_return;
-
- if (gdp) {
- ext4_lock_group(sb, block_group);
- count = ext4_free_inodes_count(sb, gdp) + 1;
- ext4_free_inodes_set(sb, gdp, count);
- if (is_directory) {
- count = ext4_used_dirs_count(sb, gdp) - 1;
- ext4_used_dirs_set(sb, gdp, count);
- if (sbi->s_log_groups_per_flex) {
- ext4_group_t f;
-
- f = ext4_flex_group(sbi, block_group);
- atomic_dec(&sbi->s_flex_groups[f].used_dirs);
- }
+ }
+ ext4_lock_group(sb, block_group);
+ if (fatal) {
+ /* Skip group descriptor update, update only inode bitmaps */
+ cleared = ext4_clear_bit(bit, bitmap_bh->b_data);
+ ext4_unlock_group(sb, block_group);
+ goto out;
+ }
- }
- gdp->bg_checksum = ext4_group_desc_csum(sbi,
- block_group, gdp);
- ext4_unlock_group(sb, block_group);
- percpu_counter_inc(&sbi->s_freeinodes_counter);
- if (is_directory)
- percpu_counter_dec(&sbi->s_dirs_counter);
-
- if (sbi->s_log_groups_per_flex) {
- ext4_group_t f;
-
- f = ext4_flex_group(sbi, block_group);
- atomic_inc(&sbi->s_flex_groups[f].free_inodes);
- }
- }
- BUFFER_TRACE(bh2, "call ext4_handle_dirty_metadata");
- err = ext4_handle_dirty_metadata(handle, NULL, bh2);
- if (!fatal) fatal = err;
+ /* Ok, now we can actually update the inode bitmaps.. */
+ cleared = ext4_clear_bit(bit, bitmap_bh->b_data);
+ if (!cleared) {
+ ext4_unlock_group(sb, block_group);
+ goto out;
}
- BUFFER_TRACE(bitmap_bh, "call ext4_handle_dirty_metadata");
- err = ext4_handle_dirty_metadata(handle, NULL, bitmap_bh);
- if (!fatal)
- fatal = err;
- sb->s_dirt = 1;
+ count = ext4_free_inodes_count(sb, gdp) + 1;
+ ext4_free_inodes_set(sb, gdp, count);
+ if (is_directory) {
+ count = ext4_used_dirs_count(sb, gdp) - 1;
+ ext4_used_dirs_set(sb, gdp, count);
+ }
+ gdp->bg_checksum = ext4_group_desc_csum(sbi, block_group, gdp);
+ ext4_unlock_group(sb, block_group);
+
+ percpu_counter_inc(&sbi->s_freeinodes_counter);
+ if (is_directory)
+ percpu_counter_dec(&sbi->s_dirs_counter);
+ if (sbi->s_log_groups_per_flex) {
+ ext4_group_t f = ext4_flex_group(sbi, block_group);
+ atomic_inc(&sbi->s_flex_groups[f].free_inodes);
+ if (is_directory)
+ atomic_dec(&sbi->s_flex_groups[f].used_dirs);
+ }
+ BUFFER_TRACE(bh2, "call ext4_handle_dirty_metadata");
+ fatal = ext4_handle_dirty_metadata(handle, NULL, bh2);
+out:
+ if (cleared) {
+ BUFFER_TRACE(bitmap_bh, "call ext4_handle_dirty_metadata");
+ err = ext4_handle_dirty_metadata(handle, NULL, bitmap_bh);
+ if (!fatal)
+ fatal = err;
+ sb->s_dirt = 1;
+ } else
+ ext4_error(sb, "bit already cleared for inode %lu", ino);
+
error_return:
brelse(bitmap_bh);
ext4_std_error(sb, fatal);
+ return;
}
/*
--
1.6.6.1
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: ext34_free_inode's mess
2010-04-14 11:19 ext34_free_inode's mess Dmitry Monakhov
2010-04-14 11:23 ` [PATCH 1/2] ext3: fix inode bitmaps manipulation in free_inode Dmitry Monakhov
@ 2010-04-14 11:35 ` Dmitry Monakhov
2010-04-14 13:34 ` Jan Kara
2010-04-14 16:01 ` Eric Sandeen
3 siblings, 0 replies; 16+ messages in thread
From: Dmitry Monakhov @ 2010-04-14 11:35 UTC (permalink / raw)
To: ext4 development; +Cc: Jan Kara
Dmitry Monakhov <dmonakhov@openvz.org> writes:
> I've finally automated my favorite testcase (see attachment),
> before i've run it by hand.
> And sometimes i've saw following complain from fsck:
BTW sometimes i've saw other corruption
e2fsck -fn /dev/sdb2
e2fsck 1.41.9 (22-Aug-2009)
Pass 1: Checking inodes, blocks, and sizes
Inode 69, i_blocks is 439472, should be 439480. Fix? no
...
By unknown reason node extent's block wasn't accounted
in to i_blocks. Now I'm digging in to that issue.
Currently I'm suspecting uninit=>init codepath
> fsck.ext4 -f -n /dev/sdb2
> ...
> Pass 5: Checking group summary information
> Inode bitmap differences: -93582
> Fix? no
>
> Free inodes count wrong for group #12 (4634, counted=4633).
> Fix? no
>
> Free inodes count wrong (35610, counted=35609).
> Fix? no
> ...
>
> I've started to look an inode bitmap manipulation code paths
> and found strange logic in ext{3,4}_free_inode functions
>
> 1) Group lock acquired twice for bitmap and for group_desc.
> There are not any advantage from this double locking, only
> error path(where the bit is already cleared) takes an
> advantage from this locking schema.
> It is reasonable to batch it in to one locking block.
> 2) if we failed to read gdp then bh2 is undefined so
> may result in oops due to undefince pointer dereferance.
> 3) if we failed to get write_access to gdp we skip
> handle_dirty_metadata for inode_bitmap which is also a bug.
>
> I've redesigned free_inode logic(see later two emails) and
> currently i'm not able to reproduce the bug, but i can not
> guarantee it is goes away.
>
> From 1857fc6c7349a67cf930e73b802427a138e43456 Mon Sep 17 00:00:00 2001
> From: Dmitry Monakhov <dmonakhov@openvz.org>
> Date: Wed, 14 Apr 2010 14:53:47 +0400
> Subject: [PATCH] xfstests-dev: add one more stress test
>
> During stress testing we want to cover most of code paths.
> fsstress is very good for this purpose. But it has expandable
> nature (disk usage almost continually grow). So once we
> goes it no ENOSPC condition we will be where till the end.
>
> But by running 'dd' in parallel we can regularly trigger
> ENOSPC but only for a limited periods of time.
>
> This is my favorite stress test-case configuration.
> ---
> 227 | 105 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 227.out | 5 +++
> group | 1 +
> 3 files changed, 111 insertions(+), 0 deletions(-)
> create mode 100755 227
> create mode 100644 227.out
>
> diff --git a/227 b/227
> new file mode 100755
> index 0000000..d2b0c7d
> --- /dev/null
> +++ b/227
> @@ -0,0 +1,105 @@
> +#! /bin/bash
> +# FS QA Test No. 227
> +#
> +# Perform fsstress test with parallel dd
> +# This proven to be a good stress test
> +# * Continuous dd retult in ENOSPC condition but only for a limited periods
> +# of time.
> +# * Fsstress test cover many code paths
> +#
> +#-----------------------------------------------------------------------
> +# Copyright (c) 2010 Dmitry Monakhov. All Rights Reserved.
> +#
> +# This program is free software; you can redistribute it and/or
> +# modify it under the terms of the GNU General Public License as
> +# published by the Free Software Foundation.
> +#
> +# This program is distributed in the hope that it would be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program; if not, write the Free Software Foundation,
> +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
> +#
> +#-----------------------------------------------------------------------
> +#
> +# creator
> +owner=dmonakhov@openvz.org
> +
> +seq=`basename $0`
> +echo "QA output created by $seq"
> +here=`pwd`
> +tmp=/tmp/$$
> +status=1 # failure is the default!
> +
> +_cleanup()
> +{
> + rm -f $tmp.*
> +}
> +
> +workout()
> +{
> + # Disable bash job controll, to prevent message about killed task.
> + set +m
> +
> + #Timing parameters
> + nr_iterations=5
> + kill_tries=20
> + echo Running fsstress. | tee -a $seq.full
> +
> +####################################################
> +## -f unresvsp=0 -f allocsp=0 -f freesp=0 \
> +## -f setxattr=0 -f attr_remove=0 -f attr_set=0 \
> +##
> +######################################################
> + mkdir -p $SCRATCH_MNT/fsstress
> + # It is reasonable to disable sync, otherwise most of tasks will simply
> + # stuck in that sync() call.
> + $FSSTRESS_PROG \
> + -d $SCRATCH_MNT/fsstress \
> + -p 100 -f sync=0 -n 9999999 > /dev/null 2>&1 &
> +
> + echo Running ENOSPC hitters. | tee -a $seq.full
> + for ((i = 0; i < $nr_iterations; i++))
> + do
> + #Open with O_TRUNC and then write until error
> + #hit ENOSPC each time.
> + dd if=/dev/zero of=$SCRATCH_MNT/BIG_FILE bs=1M 2> /dev/null
> + done
> +
> + for ((i = 0; i < $kill_tries; i++))
> + do
> + killall -r -q -TERM fsstress 2> /dev/null
> + sleep 1
> + done
> +}
> +
> +trap "_cleanup ; exit \$status" 0 1 2 3 15
> +
> +# get standard environment, filters and checks
> +. ./common.rc
> +. ./common.filter
> +
> +# real QA test starts here
> +_supported_fs generic
> +_supported_os Linux
> +_require_scratch
> +
> +rm -f $seq.full
> +
> +umount $TEST_DEV >/dev/null 2>&1
> +umount $SCRATCH_DEV >/dev/null 2>&1
> +echo "*** MKFS ***" >>$seq.full
> +echo "" >>$seq.full
> +_scratch_mkfs >/dev/null 2>&1 || _fail "mkfs failed"
> +_scratch_mount >/dev/null 2>&1 || _fail "mount failed"
> +
> +workout
> +umount $SCRATCH_MNT
> +echo
> +echo Checking filesystem
> +_check_scratch_fs
> +status=$?
> +exit
> diff --git a/227.out b/227.out
> new file mode 100644
> index 0000000..6a7342d
> --- /dev/null
> +++ b/227.out
> @@ -0,0 +1,5 @@
> +QA output created by 227
> +Running fsstress.
> +Running ENOSPC hitters.
> +
> +Checking filesystem
> diff --git a/group b/group
> index 8d4a83a..81a2aa4 100644
> --- a/group
> +++ b/group
> @@ -339,3 +339,4 @@ deprecated
> 223 auto quick
> 224 auto
> 225 auto quick
> +227 rw auto prealloc enospc
> \ No newline at end of file
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: ext34_free_inode's mess
2010-04-14 11:19 ext34_free_inode's mess Dmitry Monakhov
2010-04-14 11:23 ` [PATCH 1/2] ext3: fix inode bitmaps manipulation in free_inode Dmitry Monakhov
2010-04-14 11:35 ` ext34_free_inode's mess Dmitry Monakhov
@ 2010-04-14 13:34 ` Jan Kara
2010-04-14 14:33 ` Dmitry Monakhov
2010-04-14 16:03 ` Eric Sandeen
2010-04-14 16:01 ` Eric Sandeen
3 siblings, 2 replies; 16+ messages in thread
From: Jan Kara @ 2010-04-14 13:34 UTC (permalink / raw)
To: Dmitry Monakhov; +Cc: ext4 development, Jan Kara
On Wed 14-04-10 15:19:47, Dmitry Monakhov wrote:
> I've finally automated my favorite testcase (see attachment),
> before i've run it by hand.
> And sometimes i've saw following complain from fsck:
> fsck.ext4 -f -n /dev/sdb2
> ...
> Pass 5: Checking group summary information
> Inode bitmap differences: -93582
> Fix? no
>
> Free inodes count wrong for group #12 (4634, counted=4633).
> Fix? no
>
> Free inodes count wrong (35610, counted=35609).
> Fix? no
> ...
Interesting. So some inode is marked as free although it is in
use, right? That sounds like a nasty bug - if you reproduce this
again, could you use debugfs to find out what file type is that
inode? It could help looking for the bug.
> I've started to look an inode bitmap manipulation code paths
> and found strange logic in ext{3,4}_free_inode functions
>
> 1) Group lock acquired twice for bitmap and for group_desc.
> There are not any advantage from this double locking, only
> error path(where the bit is already cleared) takes an
> advantage from this locking schema.
> It is reasonable to batch it in to one locking block.
I guess you think that this happens because we pass the lock parameter
to ext3_clear_bit_atomic. But if you would actually look at the definition
of the function, you would see that it's hard to find an architecture that
uses the lock. Most architectures just use atomic bitop to clear the bit.
I actually fail to see why anyone would need the lock - probably Ted knows
:).
> 2) if we failed to read gdp then bh2 is undefined so
> may result in oops due to undefince pointer dereferance.
No, because during mount time we check that all gdp pointers exist so
ext3_get_group_desc can never fail after the mount has succeeded.
> 3) if we failed to get write_access to gdp we skip
> handle_dirty_metadata for inode_bitmap which is also a bug.
It doesn't matter. At the moment ext3_journal_get_write_access fails we
abort the journal so no writes are allowed to the filesystem anyway. So
modified bitmap has hardly any chance to get to disk and you have to
run fsck to clean up the mess anyway...
Honza
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: ext34_free_inode's mess
2010-04-14 13:34 ` Jan Kara
@ 2010-04-14 14:33 ` Dmitry Monakhov
2010-04-15 21:39 ` Jan Kara
2010-04-14 16:03 ` Eric Sandeen
1 sibling, 1 reply; 16+ messages in thread
From: Dmitry Monakhov @ 2010-04-14 14:33 UTC (permalink / raw)
To: Jan Kara; +Cc: ext4 development
Jan Kara <jack@suse.cz> writes:
> On Wed 14-04-10 15:19:47, Dmitry Monakhov wrote:
>> I've finally automated my favorite testcase (see attachment),
>> before i've run it by hand.
>> And sometimes i've saw following complain from fsck:
>> fsck.ext4 -f -n /dev/sdb2
>> ...
>> Pass 5: Checking group summary information
>> Inode bitmap differences: -93582
>> Fix? no
>>
>> Free inodes count wrong for group #12 (4634, counted=4633).
>> Fix? no
>>
>> Free inodes count wrong (35610, counted=35609).
>> Fix? no
>> ...
> Interesting. So some inode is marked as free although it is in
> use, right? That sounds like a nasty bug - if you reproduce this
> again, could you use debugfs to find out what file type is that
> inode? It could help looking for the bug.
No problems,
wget http://download.openvz.org/~dmonakhov/junk/sdb2-2.bz2
In fact i've had even better image (with only 1 free inode in a
group, but full bitmask) unfortunately i forgot to save it.
>
>> I've started to look an inode bitmap manipulation code paths
>> and found strange logic in ext{3,4}_free_inode functions
>>
>> 1) Group lock acquired twice for bitmap and for group_desc.
>> There are not any advantage from this double locking, only
>> error path(where the bit is already cleared) takes an
>> advantage from this locking schema.
>> It is reasonable to batch it in to one locking block.
> I guess you think that this happens because we pass the lock parameter
> to ext3_clear_bit_atomic. But if you would actually look at the definition
> of the function, you would see that it's hard to find an architecture that
> uses the lock. Most architectures just use atomic bitop to clear the bit.
> I actually fail to see why anyone would need the lock - probably Ted knows
> :).
>
>> 2) if we failed to read gdp then bh2 is undefined so
>> may result in oops due to undefince pointer dereferance.
> No, because during mount time we check that all gdp pointers exist so
> ext3_get_group_desc can never fail after the mount has succeeded.
Yes, that is right, why we have to check gdp to NULL when?
>> 3) if we failed to get write_access to gdp we skip
>> handle_dirty_metadata for inode_bitmap which is also a bug.
> It doesn't matter. At the moment ext3_journal_get_write_access fails we
> abort the journal so no writes are allowed to the filesystem anyway. So
> modified bitmap has hardly any chance to get to disk and you have to
> run fsck to clean up the mess anyway...
>
> Honza
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: ext34_free_inode's mess
2010-04-14 11:19 ext34_free_inode's mess Dmitry Monakhov
` (2 preceding siblings ...)
2010-04-14 13:34 ` Jan Kara
@ 2010-04-14 16:01 ` Eric Sandeen
2010-04-14 16:56 ` Dmitry Monakhov
2010-04-14 23:47 ` Dave Chinner
3 siblings, 2 replies; 16+ messages in thread
From: Eric Sandeen @ 2010-04-14 16:01 UTC (permalink / raw)
To: Dmitry Monakhov; +Cc: ext4 development, Jan Kara, xfs-oss
Dmitry Monakhov wrote:
> I've finally automated my favorite testcase (see attachment),
> before i've run it by hand.
Thanks! Feel free to cc: the xfs list since the patch hits
xfstests. (I added it here)
> 227 | 105 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 227.out | 5 +++
> group | 1 +
> 3 files changed, 111 insertions(+), 0 deletions(-)
> create mode 100755 227
> create mode 100644 227.out
>
> diff --git a/227 b/227
> new file mode 100755
> index 0000000..d2b0c7d
> --- /dev/null
> +++ b/227
> @@ -0,0 +1,105 @@
> +#! /bin/bash
> +# FS QA Test No. 227
> +#
> +# Perform fsstress test with parallel dd
> +# This proven to be a good stress test
> +# * Continuous dd retult in ENOSPC condition but only for a limited periods
> +# of time.
> +# * Fsstress test cover many code paths
just little editor nitpicks:
+# Perform fsstress test with parallel dd
+# This is proven to be a good stress test
+# * Continuous dd results in ENOSPC condition but only for a limited period
+# of time.
+# * Fsstress test covers many code paths
> +#
> +#-----------------------------------------------------------------------
> +# Copyright (c) 2010 Dmitry Monakhov. All Rights Reserved.
> +#
> +# This program is free software; you can redistribute it and/or
> +# modify it under the terms of the GNU General Public License as
> +# published by the Free Software Foundation.
> +#
> +# This program is distributed in the hope that it would be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program; if not, write the Free Software Foundation,
> +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
> +#
> +#-----------------------------------------------------------------------
> +#
> +# creator
> +owner=dmonakhov@openvz.org
> +
> +seq=`basename $0`
> +echo "QA output created by $seq"
> +here=`pwd`
> +tmp=/tmp/$$
> +status=1 # failure is the default!
> +
> +_cleanup()
> +{
> + rm -f $tmp.*
> +}
> +
> +workout()
> +{
> + # Disable bash job controll, to prevent message about killed task.
s/controll/control/
> + set +m
> +
> + #Timing parameters
> + nr_iterations=5
> + kill_tries=20
> + echo Running fsstress. | tee -a $seq.full
> +
> +####################################################
What is all this for?
FWIW other fsstress tests use an $FSSTRESS_AVOID variable,
where you can set the things you want to avoid easily
> +## -f unresvsp=0 -f allocsp=0 -f freesp=0 \
> +## -f setxattr=0 -f attr_remove=0 -f attr_set=0 \
> +##
> +######################################################
> + mkdir -p $SCRATCH_MNT/fsstress
> + # It is reasonable to disable sync, otherwise most of tasks will simply
> + # stuck in that sync() call.
> + $FSSTRESS_PROG \
> + -d $SCRATCH_MNT/fsstress \
> + -p 100 -f sync=0 -n 9999999 > /dev/null 2>&1 &
> +
> + echo Running ENOSPC hitters. | tee -a $seq.full
> + for ((i = 0; i < $nr_iterations; i++))
> + do
> + #Open with O_TRUNC and then write until error
> + #hit ENOSPC each time.
> + dd if=/dev/zero of=$SCRATCH_MNT/BIG_FILE bs=1M 2> /dev/null
> + done
> +
> + for ((i = 0; i < $kill_tries; i++))
> + do
> + killall -r -q -TERM fsstress 2> /dev/null
> + sleep 1
> + done
> +}
> +
> +trap "_cleanup ; exit \$status" 0 1 2 3 15
> +
> +# get standard environment, filters and checks
> +. ./common.rc
> +. ./common.filter
> +
> +# real QA test starts here
> +_supported_fs generic
> +_supported_os Linux
> +_require_scratch
> +
> +rm -f $seq.full
> +
> +umount $TEST_DEV >/dev/null 2>&1
> +umount $SCRATCH_DEV >/dev/null 2>&1
> +echo "*** MKFS ***" >>$seq.full
> +echo "" >>$seq.full
> +_scratch_mkfs >/dev/null 2>&1 || _fail "mkfs failed"
> +_scratch_mount >/dev/null 2>&1 || _fail "mount failed"
> +
> +workout
> +umount $SCRATCH_MNT
> +echo
> +echo Checking filesystem
> +_check_scratch_fs
> +status=$?
> +exit
> diff --git a/227.out b/227.out
> new file mode 100644
> index 0000000..6a7342d
> --- /dev/null
> +++ b/227.out
> @@ -0,0 +1,5 @@
> +QA output created by 227
> +Running fsstress.
> +Running ENOSPC hitters.
> +
> +Checking filesystem
> diff --git a/group b/group
> index 8d4a83a..81a2aa4 100644
> --- a/group
> +++ b/group
> @@ -339,3 +339,4 @@ deprecated
> 223 auto quick
> 224 auto
> 225 auto quick
> +227 rw auto prealloc enospc
Is this prealloc just because fsstress may run resvsp?
FWIW, other fsstress tests aren't in that group, so this is
as little inconsistent.
Thanks for writing an xfstests patch! :)
-Eric
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: ext34_free_inode's mess
2010-04-14 13:34 ` Jan Kara
2010-04-14 14:33 ` Dmitry Monakhov
@ 2010-04-14 16:03 ` Eric Sandeen
1 sibling, 0 replies; 16+ messages in thread
From: Eric Sandeen @ 2010-04-14 16:03 UTC (permalink / raw)
To: Jan Kara; +Cc: Dmitry Monakhov, ext4 development
Jan Kara wrote:
> On Wed 14-04-10 15:19:47, Dmitry Monakhov wrote:
>> I've finally automated my favorite testcase (see attachment),
>> before i've run it by hand.
>> And sometimes i've saw following complain from fsck:
>> fsck.ext4 -f -n /dev/sdb2
>> ...
>> Pass 5: Checking group summary information
>> Inode bitmap differences: -93582
>> Fix? no
>>
>> Free inodes count wrong for group #12 (4634, counted=4633).
>> Fix? no
>>
>> Free inodes count wrong (35610, counted=35609).
>> Fix? no
>> ...
> Interesting. So some inode is marked as free although it is in
> use, right? That sounds like a nasty bug - if you reproduce this
> again, could you use debugfs to find out what file type is that
> inode? It could help looking for the bug.
running fsstress in verbose mode, and disabling link/unlink/symlink,
you can sometimes narrow it down to a sequence of operations on that file, too.
(keep track of the seed nr...)
Of course if it's a random-ish race that probably won't be of much use. :)
-Eric
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: ext34_free_inode's mess
2010-04-14 16:01 ` Eric Sandeen
@ 2010-04-14 16:56 ` Dmitry Monakhov
2010-04-14 23:47 ` Dave Chinner
1 sibling, 0 replies; 16+ messages in thread
From: Dmitry Monakhov @ 2010-04-14 16:56 UTC (permalink / raw)
To: Eric Sandeen; +Cc: ext4 development, Jan Kara, xfs-oss
[-- Attachment #1: Type: text/plain, Size: 5870 bytes --]
Eric Sandeen <sandeen@redhat.com> writes:
> Dmitry Monakhov wrote:
>> I've finally automated my favorite testcase (see attachment),
>> before i've run it by hand.
>
> Thanks! Feel free to cc: the xfs list since the patch hits
> xfstests. (I added it here)
>
>> 227 | 105 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> 227.out | 5 +++
>> group | 1 +
>> 3 files changed, 111 insertions(+), 0 deletions(-)
>> create mode 100755 227
>> create mode 100644 227.out
>>
>> diff --git a/227 b/227
>> new file mode 100755
>> index 0000000..d2b0c7d
>> --- /dev/null
>> +++ b/227
>> @@ -0,0 +1,105 @@
>> +#! /bin/bash
>> +# FS QA Test No. 227
>> +#
>> +# Perform fsstress test with parallel dd
>> +# This proven to be a good stress test
>> +# * Continuous dd retult in ENOSPC condition but only for a limited periods
>> +# of time.
>> +# * Fsstress test cover many code paths
>
> just little editor nitpicks:
>
> +# Perform fsstress test with parallel dd
> +# This is proven to be a good stress test
> +# * Continuous dd results in ENOSPC condition but only for a limited period
> +# of time.
> +# * Fsstress test covers many code paths
>
>
>> +#
>> +#-----------------------------------------------------------------------
>> +# Copyright (c) 2010 Dmitry Monakhov. All Rights Reserved.
>> +#
>> +# This program is free software; you can redistribute it and/or
>> +# modify it under the terms of the GNU General Public License as
>> +# published by the Free Software Foundation.
>> +#
>> +# This program is distributed in the hope that it would be useful,
>> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
>> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
>> +# GNU General Public License for more details.
>> +#
>> +# You should have received a copy of the GNU General Public License
>> +# along with this program; if not, write the Free Software Foundation,
>> +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
>> +#
>> +#-----------------------------------------------------------------------
>> +#
>> +# creator
>> +owner=dmonakhov@openvz.org
>> +
>> +seq=`basename $0`
>> +echo "QA output created by $seq"
>> +here=`pwd`
>> +tmp=/tmp/$$
>> +status=1 # failure is the default!
>> +
>> +_cleanup()
>> +{
>> + rm -f $tmp.*
>> +}
>> +
>> +workout()
>> +{
>> + # Disable bash job controll, to prevent message about killed task.
>
> s/controll/control/
Ok, will redo and submit it one more time.
>
>> + set +m
>> +
>> + #Timing parameters
>> + nr_iterations=5
>> + kill_tries=20
>> + echo Running fsstress. | tee -a $seq.full
>> +
>> +####################################################
>
> What is all this for?
>
> FWIW other fsstress tests use an $FSSTRESS_AVOID variable,
> where you can set the things you want to avoid easily
I've add this when investigated uninit=>init extent bug.
and forgot to remove.
>
>> +## -f unresvsp=0 -f allocsp=0 -f freesp=0 \
>> +## -f setxattr=0 -f attr_remove=0 -f attr_set=0 \
>> +##
>> +######################################################
>> + mkdir -p $SCRATCH_MNT/fsstress
>> + # It is reasonable to disable sync, otherwise most of tasks will simply
>> + # stuck in that sync() call.
>> + $FSSTRESS_PROG \
>> + -d $SCRATCH_MNT/fsstress \
>> + -p 100 -f sync=0 -n 9999999 > /dev/null 2>&1 &
>> +
>> + echo Running ENOSPC hitters. | tee -a $seq.full
>> + for ((i = 0; i < $nr_iterations; i++))
>> + do
>> + #Open with O_TRUNC and then write until error
>> + #hit ENOSPC each time.
>> + dd if=/dev/zero of=$SCRATCH_MNT/BIG_FILE bs=1M 2> /dev/null
>> + done
>> +
>> + for ((i = 0; i < $kill_tries; i++))
>> + do
>> + killall -r -q -TERM fsstress 2> /dev/null
>> + sleep 1
>> + done
>> +}
>> +
>> +trap "_cleanup ; exit \$status" 0 1 2 3 15
>> +
>> +# get standard environment, filters and checks
>> +. ./common.rc
>> +. ./common.filter
>> +
>> +# real QA test starts here
>> +_supported_fs generic
>> +_supported_os Linux
>> +_require_scratch
>> +
>> +rm -f $seq.full
>> +
>> +umount $TEST_DEV >/dev/null 2>&1
>> +umount $SCRATCH_DEV >/dev/null 2>&1
>> +echo "*** MKFS ***" >>$seq.full
>> +echo "" >>$seq.full
>> +_scratch_mkfs >/dev/null 2>&1 || _fail "mkfs failed"
>> +_scratch_mount >/dev/null 2>&1 || _fail "mount failed"
>> +
>> +workout
>> +umount $SCRATCH_MNT
>> +echo
>> +echo Checking filesystem
>> +_check_scratch_fs
>> +status=$?
>> +exit
>> diff --git a/227.out b/227.out
>> new file mode 100644
>> index 0000000..6a7342d
>> --- /dev/null
>> +++ b/227.out
>> @@ -0,0 +1,5 @@
>> +QA output created by 227
>> +Running fsstress.
>> +Running ENOSPC hitters.
>> +
>> +Checking filesystem
>> diff --git a/group b/group
>> index 8d4a83a..81a2aa4 100644
>> --- a/group
>> +++ b/group
>> @@ -339,3 +339,4 @@ deprecated
>> 223 auto quick
>> 224 auto
>> 225 auto quick
>> +227 rw auto prealloc enospc
>
> Is this prealloc just because fsstress may run resvsp?
> FWIW, other fsstress tests aren't in that group, so this is
> as little inconsistent.
Ohh. i've miss that.
BTW i've got another more bug (NULL pointer deference)
I'm able to reproduce the bug only on host with 8core HT.
see attachment for more info
Seems that it triggered a code which was never triggered before
fs/ext4/extent.c
3477: if (unlikely(EXT4_I(inode)->i_flags & EXT4_EOFBLOCKS_FL)) {
if (unlikely(!eh->eh_entries)) {
EXT4_ERROR_INODE(inode,
"eh->eh_entries == 0 ee_block
%d",
ex->ee_block);
######## OOPS here because ex == NULL. ^^^^^^^^^^^^^^
err = -EIO;
goto out2;
}
Continue digging...
[-- Attachment #2: oops-1.tag.gz --]
[-- Type: application/octet-stream, Size: 124244 bytes --]
[-- Attachment #3: Type: text/plain, Size: 121 bytes --]
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: ext34_free_inode's mess
2010-04-14 16:01 ` Eric Sandeen
2010-04-14 16:56 ` Dmitry Monakhov
@ 2010-04-14 23:47 ` Dave Chinner
1 sibling, 0 replies; 16+ messages in thread
From: Dave Chinner @ 2010-04-14 23:47 UTC (permalink / raw)
To: Eric Sandeen; +Cc: ext4 development, Dmitry Monakhov, Jan Kara, xfs-oss
On Wed, Apr 14, 2010 at 11:01:16AM -0500, Eric Sandeen wrote:
> Dmitry Monakhov wrote:
> > I've finally automated my favorite testcase (see attachment),
> > before i've run it by hand.
>
> Thanks! Feel free to cc: the xfs list since the patch hits
> xfstests. (I added it here)
>
> > 227 | 105 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > 227.out | 5 +++
> > group | 1 +
> > 3 files changed, 111 insertions(+), 0 deletions(-)
> > create mode 100755 227
> > create mode 100644 227.out
> >
> > diff --git a/227 b/227
> > new file mode 100755
> > index 0000000..d2b0c7d
> > --- /dev/null
> > +++ b/227
> > @@ -0,0 +1,105 @@
> > +#! /bin/bash
> > +# FS QA Test No. 227
> > +#
> > +# Perform fsstress test with parallel dd
> > +# This proven to be a good stress test
> > +# * Continuous dd retult in ENOSPC condition but only for a limited periods
> > +# of time.
> > +# * Fsstress test cover many code paths
>
> just little editor nitpicks:
>
> +# Perform fsstress test with parallel dd
> +# This is proven to be a good stress test
> +# * Continuous dd results in ENOSPC condition but only for a limited period
> +# of time.
> +# * Fsstress test covers many code paths
This is close to the same as test 083:
# Exercise filesystem full behaviour - run numerous fsstress
# processes in write mode on a small filesystem. NB: delayed
# allocate flushing is quite deadlock prone at the filesystem
# full boundary due to the fact that we will retry allocation
# several times after flushing, before giving back ENOSPC.
That test is not really doing anything XFS specific,
so could easily be modified to run on generic filesystems...
> > +
> > + #Timing parameters
> > + nr_iterations=5
> > + kill_tries=20
> > + echo Running fsstress. | tee -a $seq.full
> > +
> > +####################################################
>
> What is all this for?
>
> FWIW other fsstress tests use an $FSSTRESS_AVOID variable,
> where you can set the things you want to avoid easily
>
> > +## -f unresvsp=0 -f allocsp=0 -f freesp=0 \
> > +## -f setxattr=0 -f attr_remove=0 -f attr_set=0 \
> > +##
> > +######################################################
> > + mkdir -p $SCRATCH_MNT/fsstress
> > + # It is reasonable to disable sync, otherwise most of tasks will simply
> > + # stuck in that sync() call.
> > + $FSSTRESS_PROG \
> > + -d $SCRATCH_MNT/fsstress \
> > + -p 100 -f sync=0 -n 9999999 > /dev/null 2>&1 &
> > +
> > + echo Running ENOSPC hitters. | tee -a $seq.full
> > + for ((i = 0; i < $nr_iterations; i++))
> > + do
> > + #Open with O_TRUNC and then write until error
> > + #hit ENOSPC each time.
> > + dd if=/dev/zero of=$SCRATCH_MNT/BIG_FILE bs=1M 2> /dev/null
> > + done
OK, so on a 10GB scratch device, this is going to write 50GB of
data, which at 100MB/s is going to take roughly 10 minutes.
The test should use a limited size filesystems (mkfs_scratch_sized)
to limit the runtime...
FWIW, test 083 spends most of it's runtime at or near ENOSPC, so
once again I wonder if that is not a better test to be using...
> > +workout
> > +umount $SCRATCH_MNT
> > +echo
> > +echo Checking filesystem
> > +_check_scratch_fs
You don't need to check the scratch fs in the test - that is done by
the test harness after the test completes.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/2] ext4: fix inode bitmaps manipulation in free_inode
2010-04-14 11:23 ` [PATCH 2/2] ext4: " Dmitry Monakhov
@ 2010-04-15 0:12 ` tytso
2010-04-16 1:06 ` tytso
0 siblings, 1 reply; 16+ messages in thread
From: tytso @ 2010-04-15 0:12 UTC (permalink / raw)
To: Dmitry Monakhov; +Cc: linux-ext4, jack
This is what I dropped into the ext4 patch queue. It fixes up some
spelling errors, and a few other minor changes.
- Ted
ext4: clean up inode bitmaps manipulation in ext4_free_inode
From: Dmitry Monakhov <dmonakhov@openvz.org>
- Reorganize locking scheme to batch two atomic operation in to one.
This also allow us to state what healthy group must obey following rule
ext4_free_inodes_count(sb, gdp) == ext4_count_free(inode_bitmap, NUM);
- Fix possible undefined pointer dereference.
- Even if group descriptor stats aren't accessible we have to update
inode bitmaps.
- Move non-group members update out of group_lock.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
---
fs/ext4/ialloc.c | 88 +++++++++++++++++++++++++++---------------------------
1 files changed, 44 insertions(+), 44 deletions(-)
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 57f6eef..25fe42f 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -240,56 +240,56 @@ void ext4_free_inode(handle_t *handle, struct inode *inode)
if (fatal)
goto error_return;
- /* Ok, now we can actually update the inode bitmaps.. */
- cleared = ext4_clear_bit_atomic(ext4_group_lock_ptr(sb, block_group),
- bit, bitmap_bh->b_data);
- if (!cleared)
- ext4_error(sb, "bit already cleared for inode %lu", ino);
- else {
- gdp = ext4_get_group_desc(sb, block_group, &bh2);
-
+ fatal = -ESRCH;
+ gdp = ext4_get_group_desc(sb, block_group, &bh2);
+ if (gdp) {
BUFFER_TRACE(bh2, "get_write_access");
fatal = ext4_journal_get_write_access(handle, bh2);
- if (fatal) goto error_return;
-
- if (gdp) {
- ext4_lock_group(sb, block_group);
- count = ext4_free_inodes_count(sb, gdp) + 1;
- ext4_free_inodes_set(sb, gdp, count);
- if (is_directory) {
- count = ext4_used_dirs_count(sb, gdp) - 1;
- ext4_used_dirs_set(sb, gdp, count);
- if (sbi->s_log_groups_per_flex) {
- ext4_group_t f;
-
- f = ext4_flex_group(sbi, block_group);
- atomic_dec(&sbi->s_flex_groups[f].used_dirs);
- }
+ }
+ ext4_lock_group(sb, block_group);
+ if (fatal) {
+ /* Skip group descriptor update, update only inode bitmaps */
+ cleared = ext4_clear_bit(bit, bitmap_bh->b_data);
+ ext4_unlock_group(sb, block_group);
+ goto out;
+ }
- }
- gdp->bg_checksum = ext4_group_desc_csum(sbi,
- block_group, gdp);
- ext4_unlock_group(sb, block_group);
- percpu_counter_inc(&sbi->s_freeinodes_counter);
- if (is_directory)
- percpu_counter_dec(&sbi->s_dirs_counter);
-
- if (sbi->s_log_groups_per_flex) {
- ext4_group_t f;
-
- f = ext4_flex_group(sbi, block_group);
- atomic_inc(&sbi->s_flex_groups[f].free_inodes);
- }
- }
- BUFFER_TRACE(bh2, "call ext4_handle_dirty_metadata");
- err = ext4_handle_dirty_metadata(handle, NULL, bh2);
- if (!fatal) fatal = err;
+ /* Ok, now we can actually update the inode bitmaps.. */
+ cleared = ext4_clear_bit(bit, bitmap_bh->b_data);
+ if (!cleared) {
+ ext4_unlock_group(sb, block_group);
+ goto out;
}
- BUFFER_TRACE(bitmap_bh, "call ext4_handle_dirty_metadata");
- err = ext4_handle_dirty_metadata(handle, NULL, bitmap_bh);
- if (!fatal)
- fatal = err;
- sb->s_dirt = 1;
+ count = ext4_free_inodes_count(sb, gdp) + 1;
+ ext4_free_inodes_set(sb, gdp, count);
+ if (is_directory) {
+ count = ext4_used_dirs_count(sb, gdp) - 1;
+ ext4_used_dirs_set(sb, gdp, count);
+ percpu_counter_dec(&sbi->s_dirs_counter);
+ }
+ gdp->bg_checksum = ext4_group_desc_csum(sbi, block_group, gdp);
+ ext4_unlock_group(sb, block_group);
+
+ percpu_counter_inc(&sbi->s_freeinodes_counter);
+ if (sbi->s_log_groups_per_flex) {
+ ext4_group_t f = ext4_flex_group(sbi, block_group);
+
+ atomic_inc(&sbi->s_flex_groups[f].free_inodes);
+ if (is_directory)
+ atomic_dec(&sbi->s_flex_groups[f].used_dirs);
+ }
+ BUFFER_TRACE(bh2, "call ext4_handle_dirty_metadata");
+ fatal = ext4_handle_dirty_metadata(handle, NULL, bh2);
+out:
+ if (cleared) {
+ BUFFER_TRACE(bitmap_bh, "call ext4_handle_dirty_metadata");
+ err = ext4_handle_dirty_metadata(handle, NULL, bitmap_bh);
+ if (!fatal)
+ fatal = err;
+ sb->s_dirt = 1;
+ } else
+ ext4_error(sb, "bit already cleared for inode %lu", ino);
+
error_return:
brelse(bitmap_bh);
ext4_std_error(sb, fatal);
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: ext34_free_inode's mess
2010-04-14 14:33 ` Dmitry Monakhov
@ 2010-04-15 21:39 ` Jan Kara
2010-04-15 22:01 ` Dmitry Monakhov
0 siblings, 1 reply; 16+ messages in thread
From: Jan Kara @ 2010-04-15 21:39 UTC (permalink / raw)
To: Dmitry Monakhov; +Cc: Jan Kara, ext4 development
On Wed 14-04-10 18:33:30, Dmitry Monakhov wrote:
> Jan Kara <jack@suse.cz> writes:
>
> > On Wed 14-04-10 15:19:47, Dmitry Monakhov wrote:
> >> I've finally automated my favorite testcase (see attachment),
> >> before i've run it by hand.
> >> And sometimes i've saw following complain from fsck:
> >> fsck.ext4 -f -n /dev/sdb2
> >> ...
> >> Pass 5: Checking group summary information
> >> Inode bitmap differences: -93582
> >> Fix? no
> >>
> >> Free inodes count wrong for group #12 (4634, counted=4633).
> >> Fix? no
> >>
> >> Free inodes count wrong (35610, counted=35609).
> >> Fix? no
> >> ...
> > Interesting. So some inode is marked as free although it is in
> > use, right? That sounds like a nasty bug - if you reproduce this
> > again, could you use debugfs to find out what file type is that
> > inode? It could help looking for the bug.
> No problems,
> wget http://download.openvz.org/~dmonakhov/junk/sdb2-2.bz2
> In fact i've had even better image (with only 1 free inode in a
> group, but full bitmask) unfortunately i forgot to save it.
I've looked at it: So the problem is the other way around (I always
confuse this). The inode is properly deleted but the bit remains set
in the bitmap. What is strange is that group descriptor counts are
correct so it's really only the bitmap bit that is wrong. I've looked
through the inode allocation and freeing code back and forth but I could
not find a place where this could realistically happen.
So just for record:
Inode has mtime = ctime = atime = dtime (so it was really deleted), i_nlink
= 0, it is a directory, i_disksize = 4096, i_blocks = 0. So indeed it looks
that we were in ext4_mkdir, we failed to allocate the block for directory
and went to out_clear_inode (thus i_disksize remained to be set to 4096,
otherwise it would be set to 0)... But how it happened that the bit in the
bitmap didn't get cleared while the group descriptors were updated is
beyond me.
Alternatively the inode could have been deleted just fine and later we
just set the bit in the inode bitmap and didn't update anything else. But
even this does not seem to be possible to me...
Since you can reproduce it, good first step would be to
> >> I've started to look an inode bitmap manipulation code paths
> >> and found strange logic in ext{3,4}_free_inode functions
> >>
> >> 1) Group lock acquired twice for bitmap and for group_desc.
> >> There are not any advantage from this double locking, only
> >> error path(where the bit is already cleared) takes an
> >> advantage from this locking schema.
> >> It is reasonable to batch it in to one locking block.
> > I guess you think that this happens because we pass the lock parameter
> > to ext3_clear_bit_atomic. But if you would actually look at the definition
> > of the function, you would see that it's hard to find an architecture that
> > uses the lock. Most architectures just use atomic bitop to clear the bit.
> > I actually fail to see why anyone would need the lock - probably Ted knows
> > :).
> >
> >> 2) if we failed to read gdp then bh2 is undefined so
> >> may result in oops due to undefince pointer dereferance.
> > No, because during mount time we check that all gdp pointers exist so
> > ext3_get_group_desc can never fail after the mount has succeeded.
> Yes, that is right, why we have to check gdp to NULL when?
Hmm, I've looked at the code again and I think the check is there mainly
to avoid Oops in case filesystem got corrupted and we computed some bogus
group number. Not that I would see how that could happen in this particular
case but in some other uses of ext3_get_group_desc it could happen. So
moving the gdp check before we use bh2 probably makes some sence (although
it's probably just a style cleanup in this case).
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: ext34_free_inode's mess
2010-04-15 21:39 ` Jan Kara
@ 2010-04-15 22:01 ` Dmitry Monakhov
2010-04-16 13:33 ` tytso
0 siblings, 1 reply; 16+ messages in thread
From: Dmitry Monakhov @ 2010-04-15 22:01 UTC (permalink / raw)
To: Jan Kara; +Cc: ext4 development
Jan Kara <jack@suse.cz> writes:
> On Wed 14-04-10 18:33:30, Dmitry Monakhov wrote:
>> Jan Kara <jack@suse.cz> writes:
>>
>> > On Wed 14-04-10 15:19:47, Dmitry Monakhov wrote:
>> >> I've finally automated my favorite testcase (see attachment),
>> >> before i've run it by hand.
>> >> And sometimes i've saw following complain from fsck:
>> >> fsck.ext4 -f -n /dev/sdb2
>> >> ...
>> >> Pass 5: Checking group summary information
>> >> Inode bitmap differences: -93582
>> >> Fix? no
>> >>
>> >> Free inodes count wrong for group #12 (4634, counted=4633).
>> >> Fix? no
>> >>
>> >> Free inodes count wrong (35610, counted=35609).
>> >> Fix? no
>> >> ...
>> > Interesting. So some inode is marked as free although it is in
>> > use, right? That sounds like a nasty bug - if you reproduce this
>> > again, could you use debugfs to find out what file type is that
>> > inode? It could help looking for the bug.
>> No problems,
>> wget http://download.openvz.org/~dmonakhov/junk/sdb2-2.bz2
>> In fact i've had even better image (with only 1 free inode in a
>> group, but full bitmask) unfortunately i forgot to save it.
> I've looked at it: So the problem is the other way around (I always
> confuse this). The inode is properly deleted but the bit remains set
> in the bitmap. What is strange is that group descriptor counts are
> correct so it's really only the bitmap bit that is wrong. I've looked
> through the inode allocation and freeing code back and forth but I could
> not find a place where this could realistically happen.
> So just for record:
> Inode has mtime = ctime = atime = dtime (so it was really deleted), i_nlink
> = 0, it is a directory, i_disksize = 4096, i_blocks = 0. So indeed it looks
> that we were in ext4_mkdir, we failed to allocate the block for directory
> and went to out_clear_inode (thus i_disksize remained to be set to 4096,
> otherwise it would be set to 0)... But how it happened that the bit in the
> bitmap didn't get cleared while the group descriptors were updated is
> beyond me.
> Alternatively the inode could have been deleted just fine and later we
> just set the bit in the inode bitmap and didn't update anything else. But
> even this does not seem to be possible to me...
> Since you can reproduce it, good first step would be to
I will, but for now i'm working on fix for OOPS
from fs/ext4/extents.c:3479 due to ex == NULL
I'll create new bug in bugzilla for this in a minute.
>
>> >> I've started to look an inode bitmap manipulation code paths
>> >> and found strange logic in ext{3,4}_free_inode functions
>> >>
>> >> 1) Group lock acquired twice for bitmap and for group_desc.
>> >> There are not any advantage from this double locking, only
>> >> error path(where the bit is already cleared) takes an
>> >> advantage from this locking schema.
>> >> It is reasonable to batch it in to one locking block.
>> > I guess you think that this happens because we pass the lock parameter
>> > to ext3_clear_bit_atomic. But if you would actually look at the definition
>> > of the function, you would see that it's hard to find an architecture that
>> > uses the lock. Most architectures just use atomic bitop to clear the bit.
>> > I actually fail to see why anyone would need the lock - probably Ted knows
>> > :).
>> >
>> >> 2) if we failed to read gdp then bh2 is undefined so
>> >> may result in oops due to undefince pointer dereferance.
>> > No, because during mount time we check that all gdp pointers exist so
>> > ext3_get_group_desc can never fail after the mount has succeeded.
>> Yes, that is right, why we have to check gdp to NULL when?
> Hmm, I've looked at the code again and I think the check is there mainly
> to avoid Oops in case filesystem got corrupted and we computed some bogus
> group number. Not that I would see how that could happen in this particular
> case but in some other uses of ext3_get_group_desc it could happen. So
> moving the gdp check before we use bh2 probably makes some sence (although
> it's probably just a style cleanup in this case).
Ok, if we know that any error result in EIO or panic when let's just
call it style cleanup(simplification), imho new code is more readable.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/2] ext4: fix inode bitmaps manipulation in free_inode
2010-04-15 0:12 ` tytso
@ 2010-04-16 1:06 ` tytso
2010-04-17 10:57 ` Dmitry Monakhov
0 siblings, 1 reply; 16+ messages in thread
From: tytso @ 2010-04-16 1:06 UTC (permalink / raw)
To: Dmitry Monakhov; +Cc: linux-ext4, jack
Here's my -V3 respin of this patch, which further cleans up the code
and removes some duplicated code by only calling ext4_clear_bit() from
one call site.
I think I'm about done for this, so if you agree with my improvements
as improvements, it might be useful to port this back to ext3 version
of this patch.
- Ted
ext4: clean up inode bitmaps manipulation in ext4_free_inode
From: Dmitry Monakhov <dmonakhov@openvz.org>
- Reorganize locking scheme to batch two atomic operation in to one.
This also allow us to state what healthy group must obey following rule
ext4_free_inodes_count(sb, gdp) == ext4_count_free(inode_bitmap, NUM);
- Fix possible undefined pointer dereference.
- Even if group descriptor stats aren't accessible we have to update
inode bitmaps.
- Move non-group members update out of group_lock.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
---
fs/ext4/ialloc.c | 81 ++++++++++++++++++++++++-----------------------------
1 files changed, 37 insertions(+), 44 deletions(-)
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 57f6eef..52618d5 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -240,56 +240,49 @@ void ext4_free_inode(handle_t *handle, struct inode *inode)
if (fatal)
goto error_return;
- /* Ok, now we can actually update the inode bitmaps.. */
- cleared = ext4_clear_bit_atomic(ext4_group_lock_ptr(sb, block_group),
- bit, bitmap_bh->b_data);
- if (!cleared)
- ext4_error(sb, "bit already cleared for inode %lu", ino);
- else {
- gdp = ext4_get_group_desc(sb, block_group, &bh2);
-
+ fatal = -ESRCH;
+ gdp = ext4_get_group_desc(sb, block_group, &bh2);
+ if (gdp) {
BUFFER_TRACE(bh2, "get_write_access");
fatal = ext4_journal_get_write_access(handle, bh2);
- if (fatal) goto error_return;
-
- if (gdp) {
- ext4_lock_group(sb, block_group);
- count = ext4_free_inodes_count(sb, gdp) + 1;
- ext4_free_inodes_set(sb, gdp, count);
- if (is_directory) {
- count = ext4_used_dirs_count(sb, gdp) - 1;
- ext4_used_dirs_set(sb, gdp, count);
- if (sbi->s_log_groups_per_flex) {
- ext4_group_t f;
-
- f = ext4_flex_group(sbi, block_group);
- atomic_dec(&sbi->s_flex_groups[f].used_dirs);
- }
+ }
+ ext4_lock_group(sb, block_group);
+ cleared = ext4_clear_bit(bit, bitmap_bh->b_data);
+ if (fatal || !cleared) {
+ ext4_unlock_group(sb, block_group);
+ goto out;
+ }
- }
- gdp->bg_checksum = ext4_group_desc_csum(sbi,
- block_group, gdp);
- ext4_unlock_group(sb, block_group);
- percpu_counter_inc(&sbi->s_freeinodes_counter);
- if (is_directory)
- percpu_counter_dec(&sbi->s_dirs_counter);
-
- if (sbi->s_log_groups_per_flex) {
- ext4_group_t f;
-
- f = ext4_flex_group(sbi, block_group);
- atomic_inc(&sbi->s_flex_groups[f].free_inodes);
- }
- }
- BUFFER_TRACE(bh2, "call ext4_handle_dirty_metadata");
- err = ext4_handle_dirty_metadata(handle, NULL, bh2);
- if (!fatal) fatal = err;
+ count = ext4_free_inodes_count(sb, gdp) + 1;
+ ext4_free_inodes_set(sb, gdp, count);
+ if (is_directory) {
+ count = ext4_used_dirs_count(sb, gdp) - 1;
+ ext4_used_dirs_set(sb, gdp, count);
+ percpu_counter_dec(&sbi->s_dirs_counter);
}
- BUFFER_TRACE(bitmap_bh, "call ext4_handle_dirty_metadata");
- err = ext4_handle_dirty_metadata(handle, NULL, bitmap_bh);
- if (!fatal)
- fatal = err;
- sb->s_dirt = 1;
+ gdp->bg_checksum = ext4_group_desc_csum(sbi, block_group, gdp);
+ ext4_unlock_group(sb, block_group);
+
+ percpu_counter_inc(&sbi->s_freeinodes_counter);
+ if (sbi->s_log_groups_per_flex) {
+ ext4_group_t f = ext4_flex_group(sbi, block_group);
+
+ atomic_inc(&sbi->s_flex_groups[f].free_inodes);
+ if (is_directory)
+ atomic_dec(&sbi->s_flex_groups[f].used_dirs);
+ }
+ BUFFER_TRACE(bh2, "call ext4_handle_dirty_metadata");
+ fatal = ext4_handle_dirty_metadata(handle, NULL, bh2);
+out:
+ if (cleared) {
+ BUFFER_TRACE(bitmap_bh, "call ext4_handle_dirty_metadata");
+ err = ext4_handle_dirty_metadata(handle, NULL, bitmap_bh);
+ if (!fatal)
+ fatal = err;
+ sb->s_dirt = 1;
+ } else
+ ext4_error(sb, "bit already cleared for inode %lu", ino);
+
error_return:
brelse(bitmap_bh);
ext4_std_error(sb, fatal);
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: ext34_free_inode's mess
2010-04-15 22:01 ` Dmitry Monakhov
@ 2010-04-16 13:33 ` tytso
0 siblings, 0 replies; 16+ messages in thread
From: tytso @ 2010-04-16 13:33 UTC (permalink / raw)
To: Dmitry Monakhov; +Cc: Jan Kara, ext4 development
On Fri, Apr 16, 2010 at 02:01:35AM +0400, Dmitry Monakhov wrote:
> Ok, if we know that any error result in EIO or panic when let's just
> call it style cleanup(simplification), imho new code is more readable.
Agreed. The reason you're seeing me respin this patch a few times is
because we recently added some additional qualification testing for
ext4 in $DAYJOB, and we've found that running dbench followed by fsck
-fy also seems to be a good way of tickling this bug --- and applying
the patch which you wrote does seem to make it go away.
Like you, I can't reproduce the problem once the patch has been
applied; and like you and Jan, I can't see how this patch would
actually fix a race or some other bug. But given that (a) it
definitely is a code cleanup, and (b) it empircally seems to make the
bug go away, and (c) we've seen this problem in our production
servers, I'm inclined to take it.
I hope to spend a bit more time in the next few days trying to figure
out what the actual root cause is, so we can figure out whether this
is really fixing a problem, or just making it harder to hit.
Dmitry, I need to thank you for all of the ext4 testing and bug fixing
you've been doing. I really appreciate it!!! I'm pretty sure BTW
that BZ #15792 is also one that we've seen on our production servers,
and so you're finding issues that aren't just showing up in
regression/stress test suites, but can and actually do happen in
real-world settings.
- Ted
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/2] ext4: fix inode bitmaps manipulation in free_inode
2010-04-16 1:06 ` tytso
@ 2010-04-17 10:57 ` Dmitry Monakhov
0 siblings, 0 replies; 16+ messages in thread
From: Dmitry Monakhov @ 2010-04-17 10:57 UTC (permalink / raw)
To: tytso; +Cc: linux-ext4, jack
tytso@mit.edu writes:
> Here's my -V3 respin of this patch, which further cleans up the code
> and removes some duplicated code by only calling ext4_clear_bit() from
> one call site.
>
> I think I'm about done for this, so if you agree with my improvements
> as improvements, it might be useful to port this back to ext3 version
> of this patch.
Ok, agree that's looks better.
>
> - Ted
>
> ext4: clean up inode bitmaps manipulation in ext4_free_inode
>
> From: Dmitry Monakhov <dmonakhov@openvz.org>
>
> - Reorganize locking scheme to batch two atomic operation in to one.
> This also allow us to state what healthy group must obey following rule
> ext4_free_inodes_count(sb, gdp) == ext4_count_free(inode_bitmap, NUM);
> - Fix possible undefined pointer dereference.
> - Even if group descriptor stats aren't accessible we have to update
> inode bitmaps.
> - Move non-group members update out of group_lock.
>
> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
> ---
> fs/ext4/ialloc.c | 81 ++++++++++++++++++++++++-----------------------------
> 1 files changed, 37 insertions(+), 44 deletions(-)
>
> diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
> index 57f6eef..52618d5 100644
> --- a/fs/ext4/ialloc.c
> +++ b/fs/ext4/ialloc.c
> @@ -240,56 +240,49 @@ void ext4_free_inode(handle_t *handle, struct inode *inode)
> if (fatal)
> goto error_return;
>
> - /* Ok, now we can actually update the inode bitmaps.. */
> - cleared = ext4_clear_bit_atomic(ext4_group_lock_ptr(sb, block_group),
> - bit, bitmap_bh->b_data);
> - if (!cleared)
> - ext4_error(sb, "bit already cleared for inode %lu", ino);
> - else {
> - gdp = ext4_get_group_desc(sb, block_group, &bh2);
> -
> + fatal = -ESRCH;
> + gdp = ext4_get_group_desc(sb, block_group, &bh2);
> + if (gdp) {
> BUFFER_TRACE(bh2, "get_write_access");
> fatal = ext4_journal_get_write_access(handle, bh2);
> - if (fatal) goto error_return;
> -
> - if (gdp) {
> - ext4_lock_group(sb, block_group);
> - count = ext4_free_inodes_count(sb, gdp) + 1;
> - ext4_free_inodes_set(sb, gdp, count);
> - if (is_directory) {
> - count = ext4_used_dirs_count(sb, gdp) - 1;
> - ext4_used_dirs_set(sb, gdp, count);
> - if (sbi->s_log_groups_per_flex) {
> - ext4_group_t f;
> -
> - f = ext4_flex_group(sbi, block_group);
> - atomic_dec(&sbi->s_flex_groups[f].used_dirs);
> - }
> + }
> + ext4_lock_group(sb, block_group);
> + cleared = ext4_clear_bit(bit, bitmap_bh->b_data);
> + if (fatal || !cleared) {
> + ext4_unlock_group(sb, block_group);
> + goto out;
> + }
>
> - }
> - gdp->bg_checksum = ext4_group_desc_csum(sbi,
> - block_group, gdp);
> - ext4_unlock_group(sb, block_group);
> - percpu_counter_inc(&sbi->s_freeinodes_counter);
> - if (is_directory)
> - percpu_counter_dec(&sbi->s_dirs_counter);
> -
> - if (sbi->s_log_groups_per_flex) {
> - ext4_group_t f;
> -
> - f = ext4_flex_group(sbi, block_group);
> - atomic_inc(&sbi->s_flex_groups[f].free_inodes);
> - }
> - }
> - BUFFER_TRACE(bh2, "call ext4_handle_dirty_metadata");
> - err = ext4_handle_dirty_metadata(handle, NULL, bh2);
> - if (!fatal) fatal = err;
> + count = ext4_free_inodes_count(sb, gdp) + 1;
> + ext4_free_inodes_set(sb, gdp, count);
> + if (is_directory) {
> + count = ext4_used_dirs_count(sb, gdp) - 1;
> + ext4_used_dirs_set(sb, gdp, count);
> + percpu_counter_dec(&sbi->s_dirs_counter);
> }
> - BUFFER_TRACE(bitmap_bh, "call ext4_handle_dirty_metadata");
> - err = ext4_handle_dirty_metadata(handle, NULL, bitmap_bh);
> - if (!fatal)
> - fatal = err;
> - sb->s_dirt = 1;
> + gdp->bg_checksum = ext4_group_desc_csum(sbi, block_group, gdp);
> + ext4_unlock_group(sb, block_group);
> +
> + percpu_counter_inc(&sbi->s_freeinodes_counter);
> + if (sbi->s_log_groups_per_flex) {
> + ext4_group_t f = ext4_flex_group(sbi, block_group);
> +
> + atomic_inc(&sbi->s_flex_groups[f].free_inodes);
> + if (is_directory)
> + atomic_dec(&sbi->s_flex_groups[f].used_dirs);
> + }
> + BUFFER_TRACE(bh2, "call ext4_handle_dirty_metadata");
> + fatal = ext4_handle_dirty_metadata(handle, NULL, bh2);
> +out:
> + if (cleared) {
> + BUFFER_TRACE(bitmap_bh, "call ext4_handle_dirty_metadata");
> + err = ext4_handle_dirty_metadata(handle, NULL, bitmap_bh);
> + if (!fatal)
> + fatal = err;
> + sb->s_dirt = 1;
> + } else
> + ext4_error(sb, "bit already cleared for inode %lu", ino);
> +
> error_return:
> brelse(bitmap_bh);
> ext4_std_error(sb, fatal);
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2010-04-17 10:57 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-04-14 11:19 ext34_free_inode's mess Dmitry Monakhov
2010-04-14 11:23 ` [PATCH 1/2] ext3: fix inode bitmaps manipulation in free_inode Dmitry Monakhov
2010-04-14 11:23 ` [PATCH 2/2] ext4: " Dmitry Monakhov
2010-04-15 0:12 ` tytso
2010-04-16 1:06 ` tytso
2010-04-17 10:57 ` Dmitry Monakhov
2010-04-14 11:35 ` ext34_free_inode's mess Dmitry Monakhov
2010-04-14 13:34 ` Jan Kara
2010-04-14 14:33 ` Dmitry Monakhov
2010-04-15 21:39 ` Jan Kara
2010-04-15 22:01 ` Dmitry Monakhov
2010-04-16 13:33 ` tytso
2010-04-14 16:03 ` Eric Sandeen
2010-04-14 16:01 ` Eric Sandeen
2010-04-14 16:56 ` Dmitry Monakhov
2010-04-14 23:47 ` Dave Chinner
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).