* [Bug 45741] New: ext4 scans all disk when calling fallocate after mount on 99% full volume.
@ 2012-08-08 16:42 bugzilla-daemon
2012-08-09 18:10 ` [Bug 45741] " bugzilla-daemon
` (2 more replies)
0 siblings, 3 replies; 9+ messages in thread
From: bugzilla-daemon @ 2012-08-08 16:42 UTC (permalink / raw)
To: linux-ext4
https://bugzilla.kernel.org/show_bug.cgi?id=45741
Summary: ext4 scans all disk when calling fallocate after mount
on 99% full volume.
Product: File System
Version: 2.5
Kernel Version: 3.2.0-23-generic
Platform: All
OS/Version: Linux
Tree: Mainline
Status: NEW
Severity: high
Priority: P1
Component: ext4
AssignedTo: fs_ext4@kernel-bugs.osdl.org
ReportedBy: mirek@me.com
Regression: No
Created an attachment (id=77131)
--> (https://bugzilla.kernel.org/attachment.cgi?id=77131)
block io graph
It seems I can reproduce this problem every time.
After filling up 55TB EXT4 volume (0-50MB fallocated only files; 10% of them
were being deleted to fragment space more) to 99% full I've run into a problem
where the whole system freezes for ~5 minutes, to reproduce:
1) unmount filesystem
2) mount filesystem
3) fallocate a file
It seem that every time the system freezes for about 5 minutes.
Initially I thought the disk was doing nothing, but in fact the os seems to
scan the whole disk before continuing (graph attached) - it looks like it's
reading every single inode before proceeding with fallocate?
Kernel logs the same thing every time:
Aug 8 17:05:09 XXX kernel: [189400.847170] INFO: task jbd2/sdc1-8:18852
blocked for more than 120 seconds.
Aug 8 17:05:09 XXX kernel: [189400.847561] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 8 17:05:09 XXX kernel: [189400.868909] jbd2/sdc1-8 D ffffffff81806240
0 18852 2 0x00000000
Aug 8 17:05:09 XXX kernel: [189400.868915] ffff8801a1e33ce0 0000000000000046
ffff8801a1e33c80 ffffffff811a86ce
Aug 8 17:05:09 XXX kernel: [189400.868920] ffff8801a1e33fd8 ffff8801a1e33fd8
ffff8801a1e33fd8 0000000000013780
Aug 8 17:05:09 XXX kernel: [189400.868925] ffffffff81c0d020 ffff8802320ec4d0
ffff8801a1e33cf0 ffff8801a1e33df8
Aug 8 17:05:09 XXX kernel: [189400.868929] Call Trace:
Aug 8 17:05:09 XXX kernel: [189400.868940] [<ffffffff811a86ce>] ?
__wait_on_buffer+0x2e/0x30
Aug 8 17:05:09 XXX kernel: [189400.868947] [<ffffffff8165a55f>]
schedule+0x3f/0x60
Aug 8 17:05:09 XXX kernel: [189400.868955] [<ffffffff8126052a>]
jbd2_journal_commit_transaction+0x18a/0x1240
Aug 8 17:05:09 XXX kernel: [189400.868962] [<ffffffff8165c6fe>] ?
_raw_spin_lock_irqsave+0x2e/0x40
Aug 8 17:05:09 XXX kernel: [189400.868970] [<ffffffff81077198>] ?
lock_timer_base.isra.29+0x38/0x70
Aug 8 17:05:09 XXX kernel: [189400.868976] [<ffffffff8108aec0>] ?
add_wait_queue+0x60/0x60
Aug 8 17:05:09 XXX kernel: [189400.868982] [<ffffffff812652ab>]
kjournald2+0xbb/0x220
Aug 8 17:05:09 XXX kernel: [189400.868988] [<ffffffff8108aec0>] ?
add_wait_queue+0x60/0x60
Aug 8 17:05:09 XXX kernel: [189400.868993] [<ffffffff812651f0>] ?
commit_timeout+0x10/0x10
Aug 8 17:05:09 XXX kernel: [189400.868999] [<ffffffff8108a42c>]
kthread+0x8c/0xa0
Aug 8 17:05:09 XXX kernel: [189400.869005] [<ffffffff81666bf4>]
kernel_thread_helper+0x4/0x10
Aug 8 17:05:09 XXX kernel: [189400.869011] [<ffffffff8108a3a0>] ?
flush_kthread_worker+0xa0/0xa0
Aug 8 17:05:09 XXX kernel: [189400.869016] [<ffffffff81666bf0>] ?
gs_change+0x13/0x13
Is this normal?
--
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug 45741] ext4 scans all disk when calling fallocate after mount on 99% full volume.
2012-08-08 16:42 [Bug 45741] New: ext4 scans all disk when calling fallocate after mount on 99% full volume bugzilla-daemon
@ 2012-08-09 18:10 ` bugzilla-daemon
2012-08-10 18:21 ` [PATCH] ext4: don't load the block bitmap for block groups which have no space Theodore Ts'o
2012-10-15 21:24 ` [Bug 45741] ext4 scans all disk when calling fallocate after mount on 99% full volume bugzilla-daemon
2012-11-08 14:21 ` bugzilla-daemon
2 siblings, 1 reply; 9+ messages in thread
From: bugzilla-daemon @ 2012-08-09 18:10 UTC (permalink / raw)
To: linux-ext4
https://bugzilla.kernel.org/show_bug.cgi?id=45741
Theodore Tso <tytso@mit.edu> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |tytso@mit.edu
--- Comment #1 from Theodore Tso <tytso@mit.edu> 2012-08-09 18:10:59 ---
It's not scanning every single inode (that would take a lot longer!), but it is
scanning every single block allocation bitmap. The problem is that we know
how many free blocks are in a block group, but we don't know the distribution
of the free blocks. The distribution (there X blocks of size 2**3, Y blocks of
size 2**4, etc.) is cached in memory, but the first time you unmount and mount
the file system, we need to read in the block bitmap for a block group.
Normally, we only do this until we find a suitable group, but when the file
system is completely full, we might need to scan the entire disk.
I've looked at mballoc, and there are some things we can fix on our side.
We're reading in the block bitmap without first checking to see if the block
group is completely filled. So that's an easy fix on our side, which will help
at least somewhat. So thanks for for reporting this.
That being said, it's a really bad idea to try to use a file system to 99%.
Above 80%, the file system performance definitely starts to fall off, and by
the time you get up to 95%, performance is going to be really awful. There are
definitely things we can do to improve things, but ultimately, it's something
that you should plan for.
You could also try increasing the flex-bg size, which is a configuration knob
when the file system is formatted. This collects allocation bitmaps for
adjacent block groups together. The default is 16, but you could try bumping
that up to 64 or even 128. It will improve the time needed to scan all of the
allocation bitmaps in the cold cache case, but it may also decrease performance
after that, when you need to allocate and delalocate inodes and blocks, and by
increasing the distance from data blocks to the inode table. How much this
tradeoff will work is going to be very dependent on the details of your
workload.
--
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH] ext4: don't load the block bitmap for block groups which have no space
2012-08-09 18:10 ` [Bug 45741] " bugzilla-daemon
@ 2012-08-10 18:21 ` Theodore Ts'o
2012-08-13 16:02 ` Eric Sandeen
0 siblings, 1 reply; 9+ messages in thread
From: Theodore Ts'o @ 2012-08-10 18:21 UTC (permalink / raw)
To: Ext4 Developers List; +Cc: Theodore Ts'o
Add a short circuit check to ext4_mb_group_group() so that we don't
bother to load the block bitmap for a block group which does not have
any space available. (Or which does not have enough space until we
are in desperation mode, i.e., when cr == 3.)
Resolves-bug: https://bugzilla.kernel.org/show_bug.cgi?id=45741
Reported-by: mirek@me.com
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
---
fs/ext4/mballoc.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 8eae947..3a57975 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -1862,6 +1862,12 @@ static int ext4_mb_good_group(struct ext4_allocation_context *ac,
BUG_ON(cr < 0 || cr >= 4);
+ free = grp->bb_free;
+ if (free == 0)
+ return 0;
+ if (cr <= 2 && free < ac->ac_g_ex.fe_len)
+ return 0;
+
/* We only do this if the grp has never been initialized */
if (unlikely(EXT4_MB_GRP_NEED_INIT(grp))) {
int ret = ext4_mb_init_group(ac->ac_sb, group);
@@ -1869,10 +1875,7 @@ static int ext4_mb_good_group(struct ext4_allocation_context *ac,
return 0;
}
- free = grp->bb_free;
fragments = grp->bb_fragments;
- if (free == 0)
- return 0;
if (fragments == 0)
return 0;
--
1.7.12.rc0.22.gcdd159b
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH] ext4: don't load the block bitmap for block groups which have no space
2012-08-10 18:21 ` [PATCH] ext4: don't load the block bitmap for block groups which have no space Theodore Ts'o
@ 2012-08-13 16:02 ` Eric Sandeen
2012-08-13 18:49 ` Theodore Ts'o
0 siblings, 1 reply; 9+ messages in thread
From: Eric Sandeen @ 2012-08-13 16:02 UTC (permalink / raw)
To: Theodore Ts'o; +Cc: Ext4 Developers List
On 8/10/12 1:21 PM, Theodore Ts'o wrote:
> Add a short circuit check to ext4_mb_group_group() so that we don't
> bother to load the block bitmap for a block group which does not have
> any space available. (Or which does not have enough space until we
> are in desperation mode, i.e., when cr == 3.)
>
> Resolves-bug: https://bugzilla.kernel.org/show_bug.cgi?id=45741
> Reported-by: mirek@me.com
> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Looks ok to me; I think this just further optimizes what was done
in
8a57d9d61a6e361c7bb159dda797672c1df1a691
ext4: check for a good block group before loading buddy pages
correct?
-Eric
> ---
> fs/ext4/mballoc.c | 9 ++++++---
> 1 file changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 8eae947..3a57975 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -1862,6 +1862,12 @@ static int ext4_mb_good_group(struct ext4_allocation_context *ac,
>
> BUG_ON(cr < 0 || cr >= 4);
>
> + free = grp->bb_free;
> + if (free == 0)
> + return 0;
> + if (cr <= 2 && free < ac->ac_g_ex.fe_len)
> + return 0;
> +
> /* We only do this if the grp has never been initialized */
> if (unlikely(EXT4_MB_GRP_NEED_INIT(grp))) {
> int ret = ext4_mb_init_group(ac->ac_sb, group);
> @@ -1869,10 +1875,7 @@ static int ext4_mb_good_group(struct ext4_allocation_context *ac,
> return 0;
> }
>
> - free = grp->bb_free;
> fragments = grp->bb_fragments;
> - if (free == 0)
> - return 0;
> if (fragments == 0)
> return 0;
>
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] ext4: don't load the block bitmap for block groups which have no space
2012-08-13 16:02 ` Eric Sandeen
@ 2012-08-13 18:49 ` Theodore Ts'o
2012-08-13 18:51 ` Eric Sandeen
2012-08-13 23:20 ` Andreas Dilger
0 siblings, 2 replies; 9+ messages in thread
From: Theodore Ts'o @ 2012-08-13 18:49 UTC (permalink / raw)
To: Eric Sandeen; +Cc: Ext4 Developers List
On Mon, Aug 13, 2012 at 11:02:08AM -0500, Eric Sandeen wrote:
>
> Looks ok to me; I think this just further optimizes what was done
> in
>
> 8a57d9d61a6e361c7bb159dda797672c1df1a691
> ext4: check for a good block group before loading buddy pages
>
> correct?
Yes, that's right; it's a further optimization.
I can think of an additional optimization where if we are reading the
block bitmap for block group N, and the block bitmap for block group
N+1 hasn't been read before (so we don't have buddy bitmap stats), and
the block bitmap for bg N+1 is adjacent for bg N, we should read both
at the same time. (And this could be generalized for N+2, N+3, etc.)
I'm not entirely sure whether it's worth the effort, but I suspect for
very full file systems, it might be very well be. This is a more
general case of the problem where most people only benchmark mostly
empty file systems, and my experience has been that above 70-80%
utilization, our performance starts to fall off. And while disk space
is cheap, it's not _that_ cheap, and there are always customers who
insist on using file systems up to a utilization of 99%, and expect
the same performance as when the file system was freshly formated. :-(
- Ted
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] ext4: don't load the block bitmap for block groups which have no space
2012-08-13 18:49 ` Theodore Ts'o
@ 2012-08-13 18:51 ` Eric Sandeen
2012-08-13 23:20 ` Andreas Dilger
1 sibling, 0 replies; 9+ messages in thread
From: Eric Sandeen @ 2012-08-13 18:51 UTC (permalink / raw)
To: Theodore Ts'o; +Cc: Ext4 Developers List
On 8/13/12 1:49 PM, Theodore Ts'o wrote:
> On Mon, Aug 13, 2012 at 11:02:08AM -0500, Eric Sandeen wrote:
>>
>> Looks ok to me; I think this just further optimizes what was done
>> in
>>
>> 8a57d9d61a6e361c7bb159dda797672c1df1a691
>> ext4: check for a good block group before loading buddy pages
>>
>> correct?
>
> Yes, that's right; it's a further optimization.
>
> I can think of an additional optimization where if we are reading the
> block bitmap for block group N, and the block bitmap for block group
> N+1 hasn't been read before (so we don't have buddy bitmap stats), and
> the block bitmap for bg N+1 is adjacent for bg N, we should read both
> at the same time. (And this could be generalized for N+2, N+3, etc.)
>
> I'm not entirely sure whether it's worth the effort, but I suspect for
> very full file systems, it might be very well be. This is a more
> general case of the problem where most people only benchmark mostly
> empty file systems, and my experience has been that above 70-80%
> utilization, our performance starts to fall off. And while disk space
> is cheap, it's not _that_ cheap, and there are always customers who
> insist on using file systems up to a utilization of 99%, and expect
> the same performance as when the file system was freshly formated. :-(
I did some tests w/ very large filesystems, fallocating 1T at a time until
full. ext4 tended to fall down pretty badly towards the end. Anything that
can reduce the time it takes to find free blocks as a very large filesystem
fills would probably be useful....
-eric
> - Ted
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] ext4: don't load the block bitmap for block groups which have no space
2012-08-13 18:49 ` Theodore Ts'o
2012-08-13 18:51 ` Eric Sandeen
@ 2012-08-13 23:20 ` Andreas Dilger
1 sibling, 0 replies; 9+ messages in thread
From: Andreas Dilger @ 2012-08-13 23:20 UTC (permalink / raw)
To: Theodore Ts'o; +Cc: Eric Sandeen, Ext4 Developers List
On 2012-08-13, at 12:49 PM, Theodore Ts'o wrote:
> On Mon, Aug 13, 2012 at 11:02:08AM -0500, Eric Sandeen wrote:
>>
>> Looks ok to me; I think this just further optimizes what was done
>> in
>>
>> 8a57d9d61a6e361c7bb159dda797672c1df1a691
>> ext4: check for a good block group before loading buddy pages
>>
>> correct?
>
> Yes, that's right; it's a further optimization.
>
> I can think of an additional optimization where if we are reading the
> block bitmap for block group N, and the block bitmap for block group
> N+1 hasn't been read before (so we don't have buddy bitmap stats), and
> the block bitmap for bg N+1 is adjacent for bg N, we should read both
> at the same time. (And this could be generalized for N+2, N+3, etc.)
I was thinking the same thing. Seems a shame that we have contiguous
bitmaps with flex_bg and don't load them all at once. However, I ended
up deciding not to pursue the issue, because I suspect the block device
will already be doing some physical block/track readahead. I guess it
couldn't hurt to submit explicit readahead requests, so long as we don't
wait for anything but the first bitmap to actually be loaded.
> I'm not entirely sure whether it's worth the effort, but I suspect for
> very full file systems, it might be very well be. This is a more
> general case of the problem where most people only benchmark mostly
> empty file systems, and my experience has been that above 70-80%
> utilization, our performance starts to fall off. And while disk space
> is cheap, it's not _that_ cheap, and there are always customers who
> insist on using file systems up to a utilization of 99%, and expect
> the same performance as when the file system was freshly formated. :-(
In my experience, there are so many factors that affect the performance
of a full filesystem that nothing can be done about it.
We've discussed changing statfs() reporting for Lustre to exclude the
"reserved" amount from the device size, so that people don't complain
"why can't I use the last 5% of the device" and/or "tune2fs -m 0" to
remove the reserved space, then complain when performance permanently
dives after hitting 100% full due to bad fragmentation of the last 5%
of files written that will not be deleted for many months. Even with
SSDs, the fragmentation is going to be seen, due to erase block
fragmentation and more IO submission overhead for small chunks.
The other significant factor is the inner/outer track performance can
vary by a factor of 2x on some drives. The ext4 allocator biases toward
outer tracks, which is good, but performance is down on the inner tracks
regardless of whether there is fragmentation or not.
Cheers, Andreas
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug 45741] ext4 scans all disk when calling fallocate after mount on 99% full volume.
2012-08-08 16:42 [Bug 45741] New: ext4 scans all disk when calling fallocate after mount on 99% full volume bugzilla-daemon
2012-08-09 18:10 ` [Bug 45741] " bugzilla-daemon
@ 2012-10-15 21:24 ` bugzilla-daemon
2012-11-08 14:21 ` bugzilla-daemon
2 siblings, 0 replies; 9+ messages in thread
From: bugzilla-daemon @ 2012-10-15 21:24 UTC (permalink / raw)
To: linux-ext4
https://bugzilla.kernel.org/show_bug.cgi?id=45741
Florian Mickler <florian@mickler.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |florian@mickler.org
--- Comment #2 from Florian Mickler <florian@mickler.org> 2012-10-15 21:24:57 ---
A patch referencing this bug report has been merged in Linux v3.7-rc1:
commit 01fc48e8929e45e67527200017cff4e74e4ba054
Author: Theodore Ts'o <tytso@mit.edu>
Date: Fri Aug 17 09:46:17 2012 -0400
ext4: don't load the block bitmap for block groups which have no space
--
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug 45741] ext4 scans all disk when calling fallocate after mount on 99% full volume.
2012-08-08 16:42 [Bug 45741] New: ext4 scans all disk when calling fallocate after mount on 99% full volume bugzilla-daemon
2012-08-09 18:10 ` [Bug 45741] " bugzilla-daemon
2012-10-15 21:24 ` [Bug 45741] ext4 scans all disk when calling fallocate after mount on 99% full volume bugzilla-daemon
@ 2012-11-08 14:21 ` bugzilla-daemon
2 siblings, 0 replies; 9+ messages in thread
From: bugzilla-daemon @ 2012-11-08 14:21 UTC (permalink / raw)
To: linux-ext4
https://bugzilla.kernel.org/show_bug.cgi?id=45741
Alan <alan@lxorguk.ukuu.org.uk> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
CC| |alan@lxorguk.ukuu.org.uk
Resolution| |CODE_FIX
--
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2012-11-08 14:21 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-08-08 16:42 [Bug 45741] New: ext4 scans all disk when calling fallocate after mount on 99% full volume bugzilla-daemon
2012-08-09 18:10 ` [Bug 45741] " bugzilla-daemon
2012-08-10 18:21 ` [PATCH] ext4: don't load the block bitmap for block groups which have no space Theodore Ts'o
2012-08-13 16:02 ` Eric Sandeen
2012-08-13 18:49 ` Theodore Ts'o
2012-08-13 18:51 ` Eric Sandeen
2012-08-13 23:20 ` Andreas Dilger
2012-10-15 21:24 ` [Bug 45741] ext4 scans all disk when calling fallocate after mount on 99% full volume bugzilla-daemon
2012-11-08 14:21 ` bugzilla-daemon
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).