* [Bug 45741] New: ext4 scans all disk when calling fallocate after mount on 99% full volume. @ 2012-08-08 16:42 bugzilla-daemon 2012-08-09 18:10 ` [Bug 45741] " bugzilla-daemon ` (2 more replies) 0 siblings, 3 replies; 9+ messages in thread From: bugzilla-daemon @ 2012-08-08 16:42 UTC (permalink / raw) To: linux-ext4 https://bugzilla.kernel.org/show_bug.cgi?id=45741 Summary: ext4 scans all disk when calling fallocate after mount on 99% full volume. Product: File System Version: 2.5 Kernel Version: 3.2.0-23-generic Platform: All OS/Version: Linux Tree: Mainline Status: NEW Severity: high Priority: P1 Component: ext4 AssignedTo: fs_ext4@kernel-bugs.osdl.org ReportedBy: mirek@me.com Regression: No Created an attachment (id=77131) --> (https://bugzilla.kernel.org/attachment.cgi?id=77131) block io graph It seems I can reproduce this problem every time. After filling up 55TB EXT4 volume (0-50MB fallocated only files; 10% of them were being deleted to fragment space more) to 99% full I've run into a problem where the whole system freezes for ~5 minutes, to reproduce: 1) unmount filesystem 2) mount filesystem 3) fallocate a file It seem that every time the system freezes for about 5 minutes. Initially I thought the disk was doing nothing, but in fact the os seems to scan the whole disk before continuing (graph attached) - it looks like it's reading every single inode before proceeding with fallocate? Kernel logs the same thing every time: Aug 8 17:05:09 XXX kernel: [189400.847170] INFO: task jbd2/sdc1-8:18852 blocked for more than 120 seconds. Aug 8 17:05:09 XXX kernel: [189400.847561] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Aug 8 17:05:09 XXX kernel: [189400.868909] jbd2/sdc1-8 D ffffffff81806240 0 18852 2 0x00000000 Aug 8 17:05:09 XXX kernel: [189400.868915] ffff8801a1e33ce0 0000000000000046 ffff8801a1e33c80 ffffffff811a86ce Aug 8 17:05:09 XXX kernel: [189400.868920] ffff8801a1e33fd8 ffff8801a1e33fd8 ffff8801a1e33fd8 0000000000013780 Aug 8 17:05:09 XXX kernel: [189400.868925] ffffffff81c0d020 ffff8802320ec4d0 ffff8801a1e33cf0 ffff8801a1e33df8 Aug 8 17:05:09 XXX kernel: [189400.868929] Call Trace: Aug 8 17:05:09 XXX kernel: [189400.868940] [<ffffffff811a86ce>] ? __wait_on_buffer+0x2e/0x30 Aug 8 17:05:09 XXX kernel: [189400.868947] [<ffffffff8165a55f>] schedule+0x3f/0x60 Aug 8 17:05:09 XXX kernel: [189400.868955] [<ffffffff8126052a>] jbd2_journal_commit_transaction+0x18a/0x1240 Aug 8 17:05:09 XXX kernel: [189400.868962] [<ffffffff8165c6fe>] ? _raw_spin_lock_irqsave+0x2e/0x40 Aug 8 17:05:09 XXX kernel: [189400.868970] [<ffffffff81077198>] ? lock_timer_base.isra.29+0x38/0x70 Aug 8 17:05:09 XXX kernel: [189400.868976] [<ffffffff8108aec0>] ? add_wait_queue+0x60/0x60 Aug 8 17:05:09 XXX kernel: [189400.868982] [<ffffffff812652ab>] kjournald2+0xbb/0x220 Aug 8 17:05:09 XXX kernel: [189400.868988] [<ffffffff8108aec0>] ? add_wait_queue+0x60/0x60 Aug 8 17:05:09 XXX kernel: [189400.868993] [<ffffffff812651f0>] ? commit_timeout+0x10/0x10 Aug 8 17:05:09 XXX kernel: [189400.868999] [<ffffffff8108a42c>] kthread+0x8c/0xa0 Aug 8 17:05:09 XXX kernel: [189400.869005] [<ffffffff81666bf4>] kernel_thread_helper+0x4/0x10 Aug 8 17:05:09 XXX kernel: [189400.869011] [<ffffffff8108a3a0>] ? flush_kthread_worker+0xa0/0xa0 Aug 8 17:05:09 XXX kernel: [189400.869016] [<ffffffff81666bf0>] ? gs_change+0x13/0x13 Is this normal? -- Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. ^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug 45741] ext4 scans all disk when calling fallocate after mount on 99% full volume. 2012-08-08 16:42 [Bug 45741] New: ext4 scans all disk when calling fallocate after mount on 99% full volume bugzilla-daemon @ 2012-08-09 18:10 ` bugzilla-daemon 2012-08-10 18:21 ` [PATCH] ext4: don't load the block bitmap for block groups which have no space Theodore Ts'o 2012-10-15 21:24 ` [Bug 45741] ext4 scans all disk when calling fallocate after mount on 99% full volume bugzilla-daemon 2012-11-08 14:21 ` bugzilla-daemon 2 siblings, 1 reply; 9+ messages in thread From: bugzilla-daemon @ 2012-08-09 18:10 UTC (permalink / raw) To: linux-ext4 https://bugzilla.kernel.org/show_bug.cgi?id=45741 Theodore Tso <tytso@mit.edu> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |tytso@mit.edu --- Comment #1 from Theodore Tso <tytso@mit.edu> 2012-08-09 18:10:59 --- It's not scanning every single inode (that would take a lot longer!), but it is scanning every single block allocation bitmap. The problem is that we know how many free blocks are in a block group, but we don't know the distribution of the free blocks. The distribution (there X blocks of size 2**3, Y blocks of size 2**4, etc.) is cached in memory, but the first time you unmount and mount the file system, we need to read in the block bitmap for a block group. Normally, we only do this until we find a suitable group, but when the file system is completely full, we might need to scan the entire disk. I've looked at mballoc, and there are some things we can fix on our side. We're reading in the block bitmap without first checking to see if the block group is completely filled. So that's an easy fix on our side, which will help at least somewhat. So thanks for for reporting this. That being said, it's a really bad idea to try to use a file system to 99%. Above 80%, the file system performance definitely starts to fall off, and by the time you get up to 95%, performance is going to be really awful. There are definitely things we can do to improve things, but ultimately, it's something that you should plan for. You could also try increasing the flex-bg size, which is a configuration knob when the file system is formatted. This collects allocation bitmaps for adjacent block groups together. The default is 16, but you could try bumping that up to 64 or even 128. It will improve the time needed to scan all of the allocation bitmaps in the cold cache case, but it may also decrease performance after that, when you need to allocate and delalocate inodes and blocks, and by increasing the distance from data blocks to the inode table. How much this tradeoff will work is going to be very dependent on the details of your workload. -- Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. ^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH] ext4: don't load the block bitmap for block groups which have no space 2012-08-09 18:10 ` [Bug 45741] " bugzilla-daemon @ 2012-08-10 18:21 ` Theodore Ts'o 2012-08-13 16:02 ` Eric Sandeen 0 siblings, 1 reply; 9+ messages in thread From: Theodore Ts'o @ 2012-08-10 18:21 UTC (permalink / raw) To: Ext4 Developers List; +Cc: Theodore Ts'o Add a short circuit check to ext4_mb_group_group() so that we don't bother to load the block bitmap for a block group which does not have any space available. (Or which does not have enough space until we are in desperation mode, i.e., when cr == 3.) Resolves-bug: https://bugzilla.kernel.org/show_bug.cgi?id=45741 Reported-by: mirek@me.com Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> --- fs/ext4/mballoc.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 8eae947..3a57975 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -1862,6 +1862,12 @@ static int ext4_mb_good_group(struct ext4_allocation_context *ac, BUG_ON(cr < 0 || cr >= 4); + free = grp->bb_free; + if (free == 0) + return 0; + if (cr <= 2 && free < ac->ac_g_ex.fe_len) + return 0; + /* We only do this if the grp has never been initialized */ if (unlikely(EXT4_MB_GRP_NEED_INIT(grp))) { int ret = ext4_mb_init_group(ac->ac_sb, group); @@ -1869,10 +1875,7 @@ static int ext4_mb_good_group(struct ext4_allocation_context *ac, return 0; } - free = grp->bb_free; fragments = grp->bb_fragments; - if (free == 0) - return 0; if (fragments == 0) return 0; -- 1.7.12.rc0.22.gcdd159b ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH] ext4: don't load the block bitmap for block groups which have no space 2012-08-10 18:21 ` [PATCH] ext4: don't load the block bitmap for block groups which have no space Theodore Ts'o @ 2012-08-13 16:02 ` Eric Sandeen 2012-08-13 18:49 ` Theodore Ts'o 0 siblings, 1 reply; 9+ messages in thread From: Eric Sandeen @ 2012-08-13 16:02 UTC (permalink / raw) To: Theodore Ts'o; +Cc: Ext4 Developers List On 8/10/12 1:21 PM, Theodore Ts'o wrote: > Add a short circuit check to ext4_mb_group_group() so that we don't > bother to load the block bitmap for a block group which does not have > any space available. (Or which does not have enough space until we > are in desperation mode, i.e., when cr == 3.) > > Resolves-bug: https://bugzilla.kernel.org/show_bug.cgi?id=45741 > Reported-by: mirek@me.com > Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Looks ok to me; I think this just further optimizes what was done in 8a57d9d61a6e361c7bb159dda797672c1df1a691 ext4: check for a good block group before loading buddy pages correct? -Eric > --- > fs/ext4/mballoc.c | 9 ++++++--- > 1 file changed, 6 insertions(+), 3 deletions(-) > > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c > index 8eae947..3a57975 100644 > --- a/fs/ext4/mballoc.c > +++ b/fs/ext4/mballoc.c > @@ -1862,6 +1862,12 @@ static int ext4_mb_good_group(struct ext4_allocation_context *ac, > > BUG_ON(cr < 0 || cr >= 4); > > + free = grp->bb_free; > + if (free == 0) > + return 0; > + if (cr <= 2 && free < ac->ac_g_ex.fe_len) > + return 0; > + > /* We only do this if the grp has never been initialized */ > if (unlikely(EXT4_MB_GRP_NEED_INIT(grp))) { > int ret = ext4_mb_init_group(ac->ac_sb, group); > @@ -1869,10 +1875,7 @@ static int ext4_mb_good_group(struct ext4_allocation_context *ac, > return 0; > } > > - free = grp->bb_free; > fragments = grp->bb_fragments; > - if (free == 0) > - return 0; > if (fragments == 0) > return 0; > > ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] ext4: don't load the block bitmap for block groups which have no space 2012-08-13 16:02 ` Eric Sandeen @ 2012-08-13 18:49 ` Theodore Ts'o 2012-08-13 18:51 ` Eric Sandeen 2012-08-13 23:20 ` Andreas Dilger 0 siblings, 2 replies; 9+ messages in thread From: Theodore Ts'o @ 2012-08-13 18:49 UTC (permalink / raw) To: Eric Sandeen; +Cc: Ext4 Developers List On Mon, Aug 13, 2012 at 11:02:08AM -0500, Eric Sandeen wrote: > > Looks ok to me; I think this just further optimizes what was done > in > > 8a57d9d61a6e361c7bb159dda797672c1df1a691 > ext4: check for a good block group before loading buddy pages > > correct? Yes, that's right; it's a further optimization. I can think of an additional optimization where if we are reading the block bitmap for block group N, and the block bitmap for block group N+1 hasn't been read before (so we don't have buddy bitmap stats), and the block bitmap for bg N+1 is adjacent for bg N, we should read both at the same time. (And this could be generalized for N+2, N+3, etc.) I'm not entirely sure whether it's worth the effort, but I suspect for very full file systems, it might be very well be. This is a more general case of the problem where most people only benchmark mostly empty file systems, and my experience has been that above 70-80% utilization, our performance starts to fall off. And while disk space is cheap, it's not _that_ cheap, and there are always customers who insist on using file systems up to a utilization of 99%, and expect the same performance as when the file system was freshly formated. :-( - Ted ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] ext4: don't load the block bitmap for block groups which have no space 2012-08-13 18:49 ` Theodore Ts'o @ 2012-08-13 18:51 ` Eric Sandeen 2012-08-13 23:20 ` Andreas Dilger 1 sibling, 0 replies; 9+ messages in thread From: Eric Sandeen @ 2012-08-13 18:51 UTC (permalink / raw) To: Theodore Ts'o; +Cc: Ext4 Developers List On 8/13/12 1:49 PM, Theodore Ts'o wrote: > On Mon, Aug 13, 2012 at 11:02:08AM -0500, Eric Sandeen wrote: >> >> Looks ok to me; I think this just further optimizes what was done >> in >> >> 8a57d9d61a6e361c7bb159dda797672c1df1a691 >> ext4: check for a good block group before loading buddy pages >> >> correct? > > Yes, that's right; it's a further optimization. > > I can think of an additional optimization where if we are reading the > block bitmap for block group N, and the block bitmap for block group > N+1 hasn't been read before (so we don't have buddy bitmap stats), and > the block bitmap for bg N+1 is adjacent for bg N, we should read both > at the same time. (And this could be generalized for N+2, N+3, etc.) > > I'm not entirely sure whether it's worth the effort, but I suspect for > very full file systems, it might be very well be. This is a more > general case of the problem where most people only benchmark mostly > empty file systems, and my experience has been that above 70-80% > utilization, our performance starts to fall off. And while disk space > is cheap, it's not _that_ cheap, and there are always customers who > insist on using file systems up to a utilization of 99%, and expect > the same performance as when the file system was freshly formated. :-( I did some tests w/ very large filesystems, fallocating 1T at a time until full. ext4 tended to fall down pretty badly towards the end. Anything that can reduce the time it takes to find free blocks as a very large filesystem fills would probably be useful.... -eric > - Ted > ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] ext4: don't load the block bitmap for block groups which have no space 2012-08-13 18:49 ` Theodore Ts'o 2012-08-13 18:51 ` Eric Sandeen @ 2012-08-13 23:20 ` Andreas Dilger 1 sibling, 0 replies; 9+ messages in thread From: Andreas Dilger @ 2012-08-13 23:20 UTC (permalink / raw) To: Theodore Ts'o; +Cc: Eric Sandeen, Ext4 Developers List On 2012-08-13, at 12:49 PM, Theodore Ts'o wrote: > On Mon, Aug 13, 2012 at 11:02:08AM -0500, Eric Sandeen wrote: >> >> Looks ok to me; I think this just further optimizes what was done >> in >> >> 8a57d9d61a6e361c7bb159dda797672c1df1a691 >> ext4: check for a good block group before loading buddy pages >> >> correct? > > Yes, that's right; it's a further optimization. > > I can think of an additional optimization where if we are reading the > block bitmap for block group N, and the block bitmap for block group > N+1 hasn't been read before (so we don't have buddy bitmap stats), and > the block bitmap for bg N+1 is adjacent for bg N, we should read both > at the same time. (And this could be generalized for N+2, N+3, etc.) I was thinking the same thing. Seems a shame that we have contiguous bitmaps with flex_bg and don't load them all at once. However, I ended up deciding not to pursue the issue, because I suspect the block device will already be doing some physical block/track readahead. I guess it couldn't hurt to submit explicit readahead requests, so long as we don't wait for anything but the first bitmap to actually be loaded. > I'm not entirely sure whether it's worth the effort, but I suspect for > very full file systems, it might be very well be. This is a more > general case of the problem where most people only benchmark mostly > empty file systems, and my experience has been that above 70-80% > utilization, our performance starts to fall off. And while disk space > is cheap, it's not _that_ cheap, and there are always customers who > insist on using file systems up to a utilization of 99%, and expect > the same performance as when the file system was freshly formated. :-( In my experience, there are so many factors that affect the performance of a full filesystem that nothing can be done about it. We've discussed changing statfs() reporting for Lustre to exclude the "reserved" amount from the device size, so that people don't complain "why can't I use the last 5% of the device" and/or "tune2fs -m 0" to remove the reserved space, then complain when performance permanently dives after hitting 100% full due to bad fragmentation of the last 5% of files written that will not be deleted for many months. Even with SSDs, the fragmentation is going to be seen, due to erase block fragmentation and more IO submission overhead for small chunks. The other significant factor is the inner/outer track performance can vary by a factor of 2x on some drives. The ext4 allocator biases toward outer tracks, which is good, but performance is down on the inner tracks regardless of whether there is fragmentation or not. Cheers, Andreas ^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug 45741] ext4 scans all disk when calling fallocate after mount on 99% full volume. 2012-08-08 16:42 [Bug 45741] New: ext4 scans all disk when calling fallocate after mount on 99% full volume bugzilla-daemon 2012-08-09 18:10 ` [Bug 45741] " bugzilla-daemon @ 2012-10-15 21:24 ` bugzilla-daemon 2012-11-08 14:21 ` bugzilla-daemon 2 siblings, 0 replies; 9+ messages in thread From: bugzilla-daemon @ 2012-10-15 21:24 UTC (permalink / raw) To: linux-ext4 https://bugzilla.kernel.org/show_bug.cgi?id=45741 Florian Mickler <florian@mickler.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |florian@mickler.org --- Comment #2 from Florian Mickler <florian@mickler.org> 2012-10-15 21:24:57 --- A patch referencing this bug report has been merged in Linux v3.7-rc1: commit 01fc48e8929e45e67527200017cff4e74e4ba054 Author: Theodore Ts'o <tytso@mit.edu> Date: Fri Aug 17 09:46:17 2012 -0400 ext4: don't load the block bitmap for block groups which have no space -- Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. ^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug 45741] ext4 scans all disk when calling fallocate after mount on 99% full volume. 2012-08-08 16:42 [Bug 45741] New: ext4 scans all disk when calling fallocate after mount on 99% full volume bugzilla-daemon 2012-08-09 18:10 ` [Bug 45741] " bugzilla-daemon 2012-10-15 21:24 ` [Bug 45741] ext4 scans all disk when calling fallocate after mount on 99% full volume bugzilla-daemon @ 2012-11-08 14:21 ` bugzilla-daemon 2 siblings, 0 replies; 9+ messages in thread From: bugzilla-daemon @ 2012-11-08 14:21 UTC (permalink / raw) To: linux-ext4 https://bugzilla.kernel.org/show_bug.cgi?id=45741 Alan <alan@lxorguk.ukuu.org.uk> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED CC| |alan@lxorguk.ukuu.org.uk Resolution| |CODE_FIX -- Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2012-11-08 14:21 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-08-08 16:42 [Bug 45741] New: ext4 scans all disk when calling fallocate after mount on 99% full volume bugzilla-daemon 2012-08-09 18:10 ` [Bug 45741] " bugzilla-daemon 2012-08-10 18:21 ` [PATCH] ext4: don't load the block bitmap for block groups which have no space Theodore Ts'o 2012-08-13 16:02 ` Eric Sandeen 2012-08-13 18:49 ` Theodore Ts'o 2012-08-13 18:51 ` Eric Sandeen 2012-08-13 23:20 ` Andreas Dilger 2012-10-15 21:24 ` [Bug 45741] ext4 scans all disk when calling fallocate after mount on 99% full volume bugzilla-daemon 2012-11-08 14:21 ` bugzilla-daemon
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).