* ext4: first write to large ext3 filesystem takes 96 seconds @ 2014-07-07 21:13 Benjamin LaHaise 2014-07-08 0:16 ` Theodore Ts'o 0 siblings, 1 reply; 10+ messages in thread From: Benjamin LaHaise @ 2014-07-07 21:13 UTC (permalink / raw) To: linux-ext4 Hi folks, I've just ran into a bug with the ext4 codebase in 3.4.91 that doesn't seem to exist in ext3, and was wondering if anyone has encountered this before. I have a 7.4TB ext3 filesystem that has been filled with 1.8TB of data. When this filesystem is freshly mounted, the first write to the filesystem takes a whopping 96 seconds to complete, during which time the system is reading about 1000 blocks per second. Subsequent writes are much quicker. The problem seems to be that ext4 is loading all of the bitmaps on the filesystem before the first write proceeds. The backtrace looks roughly as follows: [ 4480.921288] [<ffffffff81437472>] ? dm_request_fn+0x112/0x1c0 [ 4480.921292] [<ffffffff81244bf5>] ? __blk_run_queue+0x15/0x20 [ 4480.921294] [<ffffffff81246670>] ? queue_unplugged+0x20/0x50 [ 4480.921297] [<ffffffff8154ed05>] schedule+0x45/0x60 [ 4480.921299] [<ffffffff8154ef7c>] io_schedule+0x6c/0xb0 [ 4480.921301] [<ffffffff810fb7a9>] sleep_on_buffer+0x9/0x10 [ 4480.921303] [<ffffffff8154d435>] __wait_on_bit+0x55/0x80 [ 4480.921306] [<ffffffff810fb7a0>] ? unmap_underlying_metadata+0x40/0x40 [ 4480.921308] [<ffffffff810fb7a0>] ? unmap_underlying_metadata+0x40/0x40 [ 4480.921310] [<ffffffff8154d4d8>] out_of_line_wait_on_bit+0x78/0x90 [ 4480.921312] [<ffffffff8104d5b0>] ? autoremove_wake_function+0x40/0x40 [ 4480.921315] [<ffffffff810fb756>] __wait_on_buffer+0x26/0x30 [ 4480.921318] [<ffffffff81146258>] ext4_wait_block_bitmap+0x138/0x190 [ 4480.921321] [<ffffffff8116c816>] ext4_mb_init_cache+0x1e6/0x5f0 [ 4480.921324] [<ffffffff8109654a>] ? add_to_page_cache_locked+0x9a/0xd0 [ 4480.921327] [<ffffffff810965b1>] ? add_to_page_cache_lru+0x31/0x50 [ 4480.921330] [<ffffffff8116cd1f>] ext4_mb_init_group+0xff/0x1e0 [ 4480.921332] [<ffffffff8116ce9f>] ext4_mb_good_group+0x9f/0x130 [ 4480.921334] [<ffffffff8116e41f>] ext4_mb_regular_allocator+0x1bf/0x3d0 [ 4480.921337] [<ffffffff8116c0ac>] ? ext4_mb_normalize_request+0x26c/0x4d0 [ 4480.921339] [<ffffffff811716ce>] ext4_mb_new_blocks+0x2ee/0x490 [ 4480.921342] [<ffffffff81174c11>] ? ext4_get_branch+0x101/0x130 [ 4480.921345] [<ffffffff8117639c>] ext4_ind_map_blocks+0x9bc/0xc10 [ 4480.921347] [<ffffffff810fae11>] ? __getblk+0x21/0x2b0 [ 4480.921350] [<ffffffff8114c9a3>] ext4_map_blocks+0x293/0x390 [ 4480.921353] [<ffffffff8117a462>] ? do_get_write_access+0x1d2/0x450 [ 4480.921355] [<ffffffff810cb0b4>] ? kmem_cache_alloc+0xa4/0xc0 [ 4480.921358] [<ffffffff8114d9e9>] _ext4_get_block+0xa9/0x140 [ 4480.921360] [<ffffffff8114dab1>] ext4_get_block+0x11/0x20 [ 4480.921362] [<ffffffff810fbda5>] __block_write_begin+0x2b5/0x470 [ 4480.921365] [<ffffffff8114daa0>] ? noalloc_get_block_write+0x20/0x20 [ 4480.921368] [<ffffffff81096679>] ? grab_cache_page_write_begin+0xa9/0x100 [ 4480.921370] [<ffffffff8114c1e2>] ext4_write_begin+0x132/0x2f0 [ 4480.921373] [<ffffffff81095869>] generic_file_buffered_write+0x119/0x260 [ 4480.921376] [<ffffffff81096eef>] __generic_file_aio_write+0x27f/0x430 [ 4480.921379] [<ffffffff810cfbba>] ? do_huge_pmd_anonymous_page+0x1ea/0x2d0 [ 4480.921382] [<ffffffff81097101>] generic_file_aio_write+0x61/0xc0 [ 4480.921384] [<ffffffff81147c18>] ext4_file_write+0x68/0x2a0 [ 4480.921387] [<ffffffff8154e723>] ? __schedule+0x2c3/0x800 [ 4480.921389] [<ffffffff810d2a41>] do_sync_write+0xe1/0x120 [ 4480.921392] [<ffffffff8154eefa>] ? _cond_resched+0x2a/0x40 [ 4480.921395] [<ffffffff810d31c9>] vfs_write+0xc9/0x170 [ 4480.921397] [<ffffffff810d3910>] sys_write+0x50/0x90 [ 4480.921400] [<ffffffff8155155f>] sysenter_dispatch+0x7/0x1a Any thoughts? Have there been any changes to this area of the ext4 code? -ben -- "Thought is the essence of where you are now." ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: ext4: first write to large ext3 filesystem takes 96 seconds 2014-07-07 21:13 ext4: first write to large ext3 filesystem takes 96 seconds Benjamin LaHaise @ 2014-07-08 0:16 ` Theodore Ts'o 2014-07-08 1:35 ` Benjamin LaHaise 2014-07-08 5:11 ` Andreas Dilger 0 siblings, 2 replies; 10+ messages in thread From: Theodore Ts'o @ 2014-07-08 0:16 UTC (permalink / raw) To: Benjamin LaHaise; +Cc: linux-ext4 On Mon, Jul 07, 2014 at 05:13:49PM -0400, Benjamin LaHaise wrote:s > Hi folks, > > I've just ran into a bug with the ext4 codebase in 3.4.91 that doesn't seem > to exist in ext3, and was wondering if anyone has encountered this before. > I have a 7.4TB ext3 filesystem that has been filled with 1.8TB of data. > When this filesystem is freshly mounted, the first write to the filesystem > takes a whopping 96 seconds to complete, during which time the system is > reading about 1000 blocks per second. Subsequent writes are much quicker. > The problem seems to be that ext4 is loading all of the bitmaps on the > filesystem before the first write proceeds. The backtrace looks roughly as > follows: So the issue is that ext3 will just allocate the first free block it can find, even if it is a single free block in block group #1001, followed by a single free block in block group #2002. Ext4 tries a harder to find contiguous blocks. If you are using an ext3 file system format, the block allocation bitmaps are scattered across the entire file system, so we end up doing a lot random 4k seeks. We can try to be a bit smarter about how we try to search the file system for free blocks. Out of curiosity, can you send me a copy of the contents of: /proc/fs/ext4/dm-XX/mb_groups Thanks!! - Ted ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: ext4: first write to large ext3 filesystem takes 96 seconds 2014-07-08 0:16 ` Theodore Ts'o @ 2014-07-08 1:35 ` Benjamin LaHaise 2014-07-08 3:54 ` Theodore Ts'o 2014-07-08 5:11 ` Andreas Dilger 1 sibling, 1 reply; 10+ messages in thread From: Benjamin LaHaise @ 2014-07-08 1:35 UTC (permalink / raw) To: Theodore Ts'o; +Cc: linux-ext4 Hi Ted, On Mon, Jul 07, 2014 at 08:16:55PM -0400, Theodore Ts'o wrote: > On Mon, Jul 07, 2014 at 05:13:49PM -0400, Benjamin LaHaise wrote:s > > Hi folks, > > > > I've just ran into a bug with the ext4 codebase in 3.4.91 that doesn't seem > > to exist in ext3, and was wondering if anyone has encountered this before. > > I have a 7.4TB ext3 filesystem that has been filled with 1.8TB of data. > > When this filesystem is freshly mounted, the first write to the filesystem > > takes a whopping 96 seconds to complete, during which time the system is > > reading about 1000 blocks per second. Subsequent writes are much quicker. > > The problem seems to be that ext4 is loading all of the bitmaps on the > > filesystem before the first write proceeds. The backtrace looks roughly as > > follows: > > So the issue is that ext3 will just allocate the first free block it > can find, even if it is a single free block in block group #1001, > followed by a single free block in block group #2002. Ext4 tries a > harder to find contiguous blocks. > > If you are using an ext3 file system format, the block allocation > bitmaps are scattered across the entire file system, so we end up > doing a lot random 4k seeks. Yeah, we're kinda stuck with ext3 on disk for now due to a bunch of reasons. The main reason for using the ext4 codebase instead of ext3 has mostly to do with slightly better performance for some metadata intensive operations (like unlink and sync writes). > We can try to be a bit smarter about how we try to search the file > system for free blocks. > > Out of curiosity, can you send me a copy of the contents of: > > /proc/fs/ext4/dm-XX/mb_groups Sure -- I put a copy at http://www.kvack.org/~bcrl/mb_groups as it's a bit too big for the mailing list. The filesystem in question has a couple of 11GB files on it, with the remainder of the space being taken up by files 7200016 bytes in size. Cheers, -ben > Thanks!! > > - Ted -- "Thought is the essence of where you are now." ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: ext4: first write to large ext3 filesystem takes 96 seconds 2014-07-08 1:35 ` Benjamin LaHaise @ 2014-07-08 3:54 ` Theodore Ts'o 2014-07-08 14:53 ` Benjamin LaHaise 0 siblings, 1 reply; 10+ messages in thread From: Theodore Ts'o @ 2014-07-08 3:54 UTC (permalink / raw) To: Benjamin LaHaise; +Cc: linux-ext4 On Mon, Jul 07, 2014 at 09:35:11PM -0400, Benjamin LaHaise wrote: > > Sure -- I put a copy at http://www.kvack.org/~bcrl/mb_groups as it's a bit > too big for the mailing list. The filesystem in question has a couple of > 11GB files on it, with the remainder of the space being taken up by files > 7200016 bytes in size. Right, so looking at mb_groups we see a bunch of the problems. There are a large number block groups which look like this: #group: free frags first [ 2^0 2^1 2^2 2^3 2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 ] #288 : 1540 7 13056 [ 0 0 1 0 0 0 0 0 6 0 0 0 0 0 ] It would be very interesting to see what allocation pattern resulted in so many block groups with this layout. Before we read in allocation bitmap, all we know from the block group descriptors is that there are 1540 free blocks. What we don't know is that they are broken up into 6 256 block free regions, plus a 4 block region. If we try to allocate a 1024 block region, we'll end up searching a large number of these block groups before find one which is suitable. Or there is a large collection of block groups that look like this: #834 : 4900 39 514 [ 0 20 5 5 16 6 4 8 6 1 1 0 0 0 ] Similarly, we could try to look for a contiguous 2048 range, but even though there is 4900 blocks available, we can't tell the difference between something a free block layout which looks like like the above, versus one that looks like this: #834 : 4900 39 514 [ 0 6 0 1 3 5 1 4 0 0 0 2 0 0 ] We could try going straight for the largely empty block groups, but that's more likely to fragment the file system more quickly, and then once those largely empty block groups are partially used, then we'll end up taking a long time while we scan all of the block groups. - Ted ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: ext4: first write to large ext3 filesystem takes 96 seconds 2014-07-08 3:54 ` Theodore Ts'o @ 2014-07-08 14:53 ` Benjamin LaHaise 0 siblings, 0 replies; 10+ messages in thread From: Benjamin LaHaise @ 2014-07-08 14:53 UTC (permalink / raw) To: Theodore Ts'o; +Cc: linux-ext4 On Mon, Jul 07, 2014 at 11:54:05PM -0400, Theodore Ts'o wrote: > On Mon, Jul 07, 2014 at 09:35:11PM -0400, Benjamin LaHaise wrote: > > > > Sure -- I put a copy at http://www.kvack.org/~bcrl/mb_groups as it's a bit > > too big for the mailing list. The filesystem in question has a couple of > > 11GB files on it, with the remainder of the space being taken up by files > > 7200016 bytes in size. > > Right, so looking at mb_groups we see a bunch of the problems. There > are a large number block groups which look like this: > > #group: free frags first [ 2^0 2^1 2^2 2^3 2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 ] > #288 : 1540 7 13056 [ 0 0 1 0 0 0 0 0 6 0 0 0 0 0 ] > > It would be very interesting to see what allocation pattern resulted > in so many block groups with this layout. Before we read in > allocation bitmap, all we know from the block group descriptors is > that there are 1540 free blocks. What we don't know is that they are > broken up into 6 256 block free regions, plus a 4 block region. I did have to make a change to the ext4 inode allocator to bias things towards allocating inodes at the beginning of the disk (see below). Without that change the allocation pattern of writes to the filesystem resulted in a significant performance regression relative to ext3, owing mostly to the fact that fallocate() on ext4 is unimplemented for indirect style metadata. (Note that we mount the filesystem with this noorlov mount option.) With that change, the workload essentially consists of writing 7200016 files in one write() operation rotating between 100 subdirectories off the root of the filesystem. > If we try to allocate a 1024 block region, we'll end up searching a > large number of these block groups before find one which is suitable. > > Or there is a large collection of block groups that look like this: > > #834 : 4900 39 514 [ 0 20 5 5 16 6 4 8 6 1 1 0 0 0 ] > > Similarly, we could try to look for a contiguous 2048 range, but even > though there is 4900 blocks available, we can't tell the difference > between something a free block layout which looks like like the above, > versus one that looks like this: > > #834 : 4900 39 514 [ 0 6 0 1 3 5 1 4 0 0 0 2 0 0 ] > > We could try going straight for the largely empty block groups, but > that's more likely to fragment the file system more quickly, and then > once those largely empty block groups are partially used, then we'll > end up taking a long time while we scan all of the block groups. Fragmentation is not a significant concern for the workload in question. Write performance is much more important to us than read performance, and read performance tends to degrade to random reads owing to the fact that the system can have many queues (~16k) issuing reads. Hence, getting the block allocator to make writes get allocated as close to sequential on disk as possible is an important corner of performance. Ext4 with indirect blocks has a tendency to leave gaps between files, which degrades performance for this workload, since files tend not to be packed as closely together as they were with ext3. Ext4 with extents + fallocate() packs files on disk without any gaps, but turning on extents is not an option (unfortunately, as a 20+ minute fsck time / outage as part of an upgrade is not viable). -ben > - Ted > -- "Thought is the essence of where you are now." diff -pu ./fs/ext4/ext4.h /opt/cvsdirs/blahaise/kernel34-d35/fs/ext4/ext4.h --- ./fs/ext4/ext4.h 2014-03-12 16:32:21.077386952 -0400 +++ /opt/cvsdirs/blahaise/kernel34-d35/fs/ext4/ext4.h 2014-07-03 14:05:14.000000000 -0400 @@ -962,6 +962,7 @@ struct ext4_inode_info { #define EXT4_MOUNT2_EXPLICIT_DELALLOC 0x00000001 /* User explicitly specified delalloc */ +#define EXT4_MOUNT2_NO_ORLOV 0x00000002 /* Disable orlov */ #define clear_opt(sb, opt) EXT4_SB(sb)->s_mount_opt &= \ ~EXT4_MOUNT_##opt diff -pu ./fs/ext4/ialloc.c /opt/cvsdirs/blahaise/kernel34-d35/fs/ext4/ialloc.c --- ./fs/ext4/ialloc.c 2014-03-12 16:32:21.078386958 -0400 +++ /opt/cvsdirs/blahaise/kernel34-d35/fs/ext4/ialloc.c 2014-05-26 14:22:23.000000000 -0400 @@ -517,6 +517,9 @@ static int find_group_other(struct super struct ext4_group_desc *desc; int flex_size = ext4_flex_bg_size(EXT4_SB(sb)); + if (test_opt2(sb, NO_ORLOV)) + goto do_linear; + /* * Try to place the inode is the same flex group as its * parent. If we can't find space, use the Orlov algorithm to @@ -589,6 +592,7 @@ static int find_group_other(struct super return 0; } +do_linear: /* * That failed: try linear search for a free inode, even if that group * has no free blocks. @@ -655,7 +659,7 @@ struct inode *ext4_new_inode(handle_t *h goto got_group; } - if (S_ISDIR(mode)) + if (!test_opt2(sb, NO_ORLOV) && S_ISDIR(mode)) ret2 = find_group_orlov(sb, dir, &group, mode, qstr); else ret2 = find_group_other(sb, dir, &group, mode); diff -pu ./fs/ext4/super.c /opt/cvsdirs/blahaise/kernel34-d35/fs/ext4/super.c --- ./fs/ext4/super.c 2014-03-12 16:32:21.080386971 -0400 +++ /opt/cvsdirs/blahaise/kernel34-d35/fs/ext4/super.c 2014-05-26 14:22:23.000000000 -0400 @@ -1191,6 +1201,7 @@ enum { Opt_inode_readahead_blks, Opt_journal_ioprio, Opt_dioread_nolock, Opt_dioread_lock, Opt_discard, Opt_nodiscard, Opt_init_itable, Opt_noinit_itable, + Opt_noorlov }; static const match_table_t tokens = { @@ -1210,6 +1221,7 @@ static const match_table_t tokens = { {Opt_debug, "debug"}, {Opt_removed, "oldalloc"}, {Opt_removed, "orlov"}, + {Opt_noorlov, "noorlov"}, {Opt_user_xattr, "user_xattr"}, {Opt_nouser_xattr, "nouser_xattr"}, {Opt_acl, "acl"}, @@ -1376,6 +1388,7 @@ static const struct mount_opts { int token; int mount_opt; int flags; + int mount_opt2; } ext4_mount_opts[] = { {Opt_minix_df, EXT4_MOUNT_MINIX_DF, MOPT_SET}, {Opt_bsd_df, EXT4_MOUNT_MINIX_DF, MOPT_CLEAR}, @@ -1444,6 +1457,7 @@ static const struct mount_opts { {Opt_jqfmt_vfsold, QFMT_VFS_OLD, MOPT_QFMT}, {Opt_jqfmt_vfsv0, QFMT_VFS_V0, MOPT_QFMT}, {Opt_jqfmt_vfsv1, QFMT_VFS_V1, MOPT_QFMT}, + {Opt_noorlov, 0, MOPT_SET, EXT4_MOUNT2_NO_ORLOV}, {Opt_err, 0, 0} }; @@ -1562,6 +1576,7 @@ static int handle_mount_opt(struct super } else { clear_opt(sb, DATA_FLAGS); sbi->s_mount_opt |= m->mount_opt; + sbi->s_mount_opt2 |= m->mount_opt2; } #ifdef CONFIG_QUOTA } else if (m->flags & MOPT_QFMT) { @@ -1585,10 +1600,13 @@ static int handle_mount_opt(struct super WARN_ON(1); return -1; } - if (arg != 0) + if (arg != 0) { sbi->s_mount_opt |= m->mount_opt; - else + sbi->s_mount_opt2 |= m->mount_opt2; + } else { sbi->s_mount_opt &= ~m->mount_opt; + sbi->s_mount_opt2 &= ~m->mount_opt2; + } } return 1; } @@ -1736,11 +1754,15 @@ static int _ext4_show_options(struct seq if (((m->flags & (MOPT_SET|MOPT_CLEAR)) == 0) || (m->flags & MOPT_CLEAR_ERR)) continue; - if (!(m->mount_opt & (sbi->s_mount_opt ^ def_mount_opt))) + if (!(m->mount_opt & (sbi->s_mount_opt ^ def_mount_opt)) && + !(m->mount_opt2 & sbi->s_mount_opt2)) continue; /* skip if same as the default */ - if ((want_set && - (sbi->s_mount_opt & m->mount_opt) != m->mount_opt) || - (!want_set && (sbi->s_mount_opt & m->mount_opt))) + if (want_set && + (((sbi->s_mount_opt & m->mount_opt) != m->mount_opt) || + ((sbi->s_mount_opt2 & m->mount_opt2) != m->mount_opt2))) + continue; /* select Opt_noFoo vs Opt_Foo */ + if (!want_set && ((sbi->s_mount_opt & m->mount_opt) || + (sbi->s_mount_opt2 & m->mount_opt2))) continue; /* select Opt_noFoo vs Opt_Foo */ SEQ_OPTS_PRINT("%s", token2str(m->token)); } ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: ext4: first write to large ext3 filesystem takes 96 seconds 2014-07-08 0:16 ` Theodore Ts'o 2014-07-08 1:35 ` Benjamin LaHaise @ 2014-07-08 5:11 ` Andreas Dilger 2014-07-30 14:49 ` Benjamin LaHaise 1 sibling, 1 reply; 10+ messages in thread From: Andreas Dilger @ 2014-07-08 5:11 UTC (permalink / raw) To: Theodore Ts'o; +Cc: Benjamin LaHaise, linux-ext4@vger.kernel.org The main problem here is that reading all of the block bitmaps takes a huge amount of time for a large filesystem. 7.8TB / 128MB/group ~= 8000 groups 8000 bitmaps / 100 seeks/sec = 80s So that is what is making things slow. Once the allocator has all the blocks in memory there are no problems. There are some heuristics to skip bitmaps that are totally full, but they don't work in your case. This is why the flex_bg feature was created - to allow the bitmaps to be read from disk without seeks. This also speeds up e2fsck by the same 96s that would otherwise be wasted waiting for the disk. Backporting flex_bg to ext3 would be fairly trivial - just disable the checks for the location of the bitmaps at mount time. However, using it requires that you reformat your filesystem with "-O flex_bg" to get the improved layout. The other option (if your runtime environment allows it) is to prefetch the block bitmaps using "dumpe2fs /dev/XXX > /dev/null" before the filesystem is in use. This still takes 90s, but can be started early in the boot process on each disk in parallel. Cheers, Andreas > On Jul 7, 2014, at 18:16, Theodore Ts'o <tytso@mit.edu> wrote: > > On Mon, Jul 07, 2014 at 05:13:49PM -0400, Benjamin LaHaise wrote:s >> Hi folks, >> >> I've just ran into a bug with the ext4 codebase in 3.4.91 that doesn't seem >> to exist in ext3, and was wondering if anyone has encountered this before. >> I have a 7.4TB ext3 filesystem that has been filled with 1.8TB of data. >> When this filesystem is freshly mounted, the first write to the filesystem >> takes a whopping 96 seconds to complete, during which time the system is >> reading about 1000 blocks per second. Subsequent writes are much quicker. >> The problem seems to be that ext4 is loading all of the bitmaps on the >> filesystem before the first write proceeds. The backtrace looks roughly as >> follows: > > So the issue is that ext3 will just allocate the first free block it > can find, even if it is a single free block in block group #1001, > followed by a single free block in block group #2002. Ext4 tries a > harder to find contiguous blocks. > > If you are using an ext3 file system format, the block allocation > bitmaps are scattered across the entire file system, so we end up > doing a lot random 4k seeks. > > We can try to be a bit smarter about how we try to search the file > system for free blocks. > > Out of curiosity, can you send me a copy of the contents of: > > /proc/fs/ext4/dm-XX/mb_groups > > Thanks!! > > - Ted > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: ext4: first write to large ext3 filesystem takes 96 seconds 2014-07-08 5:11 ` Andreas Dilger @ 2014-07-30 14:49 ` Benjamin LaHaise 2014-07-31 13:03 ` Theodore Ts'o 0 siblings, 1 reply; 10+ messages in thread From: Benjamin LaHaise @ 2014-07-30 14:49 UTC (permalink / raw) To: Andreas Dilger; +Cc: Theodore Ts'o, linux-ext4@vger.kernel.org Hi Andreas, Ted, I've finally had some more time to dig into this problem, and it's worse than I initially thought in that it occurs on normal ext4 filesystems. On Mon, Jul 07, 2014 at 11:11:58PM -0600, Andreas Dilger wrote: ... > The main problem here is that reading all of the block bitmaps takes > a huge amount of time for a large filesystem. Very true. ... > > 7.8TB / 128MB/group ~= 8000 groups > 8000 bitmaps / 100 seeks/sec = 80s > > So that is what is making things slow. Once the allocator has all the > blocks in memory there are no problems. There are some heuristics > to skip bitmaps that are totally full, but they don't work in your case. > > This is why the flex_bg feature was created - to allow the bitmaps > to be read from disk without seeks. This also speeds up e2fsck by > the same 96s that would otherwise be wasted waiting for the disk. Unfortunately, that isn't the case. > Backporting flex_bg to ext3 would be fairly trivial - just disable the checks > for the location of the bitmaps at mount time. However, using it > requires that you reformat your filesystem with "-O flex_bg" to > get the improved layout. flex_bg is not sufficient to resolve this issue. Using a native ext4 formatted filesystem initialized with mke4fs 1.41.12, this problem still occurs. I created a 7.1TB filesystem, filled it to about 92% full with 8MB files. The time to create a new 8MB file after a fresh mount ranges from 0.017 seconds 13.2 seconds. The outlier correlates with bitmaps being read from disk. A copy of /proc/fs/ext4/dm-2/mb_groups from this 92% full fs is available at http://www.kvack.org/~bcrl/mb_groups.ext4-92 Note that is isn't the first allocating write to the filesystem that is the worst in terms of timing, it can end up being the 10th or even the 100th attempt. > The other option (if your runtime environment allows it) is to prefetch > the block bitmaps using "dumpe2fs /dev/XXX > /dev/null" before the > filesystem is in use. This still takes 90s, but can be started early in > the boot process on each disk in parallel. That isn't a solution. Prefetching is impossible in my particular use-case, as the filesystem is being mounted after a failover from another node -- any data prefetched prior to switching active nodes is not guaranteed to be valid. This seems like a pretty serious regression relative to ext3. Why can't ext4's mballoc pick better block groups to attempt allocating from based on the free block counts in the block group summaries? -ben -- "Thought is the essence of where you are now." ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: ext4: first write to large ext3 filesystem takes 96 seconds 2014-07-30 14:49 ` Benjamin LaHaise @ 2014-07-31 13:03 ` Theodore Ts'o 2014-07-31 14:04 ` Benjamin LaHaise 0 siblings, 1 reply; 10+ messages in thread From: Theodore Ts'o @ 2014-07-31 13:03 UTC (permalink / raw) To: Benjamin LaHaise; +Cc: Andreas Dilger, linux-ext4@vger.kernel.org On Wed, Jul 30, 2014 at 10:49:28AM -0400, Benjamin LaHaise wrote: > This seems like a pretty serious regression relative to ext3. Why can't > ext4's mballoc pick better block groups to attempt allocating from based > on the free block counts in the block group summaries? Allocation algorithms are *always* tradeoffs. So I don't think regression is necessarily the best way to think about things. Unfortuntaely, your use case really doesn't work well with how we have set things up with ext4 now. Sure, if you your specific use case is one where you are mostly allocating 8MB files, then we can add a special case where if you are allocating 32768 blocks, we should search for block groups that have 32768 blocks free. And if that's what you are asking for, we can certainly do that. The problem is that free block counts don't work well in general. If I see that the free block count is 2048 blocks, that doesn't tell me the free blocks are in a contiguous single chunk of 2048 blocks, or 2048 single block items. (We do actually pay attention to free blocks, by the way, but it's in a nuanced way.) If the only goal you have is fast block allocation after fail over, you can always use the VFAT block allocation --- i.e., use the first free block in the file system. Unfortunately, it will result in a very badly fragmented file system, as Microsoft and its users discovered. I'm sure that are things we could do that would make things better for your workload (if you want to tell us in great detail exactly what the file/block allocation patterns are for your workload), and perhaps even better in general, but the challenge is making sure we don't regress for other workloads --- and this includes long-term fragmentation resistance. This is a hard problem. Kvetching about how it's so horrible just for you isn't really helpful for solving it. (BTW, one of the problems is that ext4_mb_normalize_request caps large allocations so that we use the same goal length for multiple passes as we search for good block groups. We might want to use the original goal length --- so long as it is less than 32768 blocks --- for the first scan, or at least for goal lengths which are powers of two. So if your application is regularly allocating files which are exactly 8MB, there are probably some optimizations that we could apply. But if they aren't exactly 8MB, life gets a bit trickier.) Regards, - Ted ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: ext4: first write to large ext3 filesystem takes 96 seconds 2014-07-31 13:03 ` Theodore Ts'o @ 2014-07-31 14:04 ` Benjamin LaHaise 2014-07-31 15:27 ` Theodore Ts'o 0 siblings, 1 reply; 10+ messages in thread From: Benjamin LaHaise @ 2014-07-31 14:04 UTC (permalink / raw) To: Theodore Ts'o; +Cc: Andreas Dilger, linux-ext4@vger.kernel.org On Thu, Jul 31, 2014 at 09:03:32AM -0400, Theodore Ts'o wrote: > On Wed, Jul 30, 2014 at 10:49:28AM -0400, Benjamin LaHaise wrote: > > This seems like a pretty serious regression relative to ext3. Why can't > > ext4's mballoc pick better block groups to attempt allocating from based > > on the free block counts in the block group summaries? > > Allocation algorithms are *always* tradeoffs. So I don't think > regression is necessarily the best way to think about things. > Unfortuntaely, your use case really doesn't work well with how we have > set things up with ext4 now. Sure, if you your specific use case is > one where you are mostly allocating 8MB files, then we can add a > special case where if you are allocating 32768 blocks, we should > search for block groups that have 32768 blocks free. And if that's > what you are asking for, we can certainly do that. The workload targets allocation 8MB files, mostly because that is a size that is large enough to perform fairly decently, but small enough to not incur too much latency for each write. Depending on other dynamics in the system, it's possible to end up with files as small as 8K, or as large as 30MB. The target file size can certainly be tuned up or down if that makes life easier for the filesystem. > The problem is that free block counts don't work well in general. If > I see that the free block count is 2048 blocks, that doesn't tell me > the free blocks are in a contiguous single chunk of 2048 blocks, or > 2048 single block items. (We do actually pay attention to free > blocks, by the way, but it's in a nuanced way.) > > If the only goal you have is fast block allocation after fail over, > you can always use the VFAT block allocation --- i.e., use the first > free block in the file system. Unfortunately, it will result in a > very badly fragmented file system, as Microsoft and its users > discovered. Fragmentation is not a huge concern, but is more acceptable if the time to perform an allocation increases. Time to perform a write is hugely important, as the system will have more and more data coming in as time progresses. At present under load the system has to be able to sustain 550MB/s of writes to disk for an extended period of time. With 8MB writes that means we can't tolerate very many multi second writes. I am of the opinion that expecting the filesystem to be able to sustain 550MB/s is reasonable given that the underlying disk array can perform sequential reads/writes at more than 1GB/s and has a reasonably large amount of write back cache (512MB) on the RAID controller. The use-case is essentially making use of the filesystem as an elastic buffer for queues of messages. Under normal conditions all of the data is received and then sent out within a fairly short period of time, but sometimes there are receivers that are slow or offline which means that the in memory buffers get filled and need to be spilled out to disk. Many users of the system cycle this behaviour over the course of a single day. They receive a lot of data during business hours, then process and drain it over the course of the evening. Since everything is cyclic, and reads are slow anyways, long term fragmentation of the filesystem isn't a significant concern. > I'm sure that are things we could do that would make things better for > your workload (if you want to tell us in great detail exactly what the > file/block allocation patterns are for your workload), and perhaps > even better in general, but the challenge is making sure we don't > regress for other workloads --- and this includes long-term > fragmentation resistance. This is a hard problem. Kvetching about > how it's so horrible just for you isn't really helpful for solving it. I'm kvetching mostly because the mballoc code is hugely complicated and easy to break (and oh have I broken it). If you can point me in the right direction for possible improvements that you think might improve mballoc, I'll certainly give them a try. Hopefully the above descriptions of the workload make it a bit easier to understand what's going on in the big picture. I also don't think this problem is limited to my particular use-case. Any ext4 filesystem that is 7TB or more and gets up into the 80-90% utilization will probably start exhibiting this problem. I do wonder if it is at all possible to fix this issue without replacing the bitmaps used to track free space with something better suited to the task on such large filesystems. Pulling in hundreds of megabytes of bitmap blocks is always going to hurt. Fixing that would mean either compressing the bitmaps into something that can be read more quickly, or wholesale replacement of the bitmaps with something else. > (BTW, one of the problems is that ext4_mb_normalize_request caps large > allocations so that we use the same goal length for multiple passes as > we search for good block groups. We might want to use the original > goal length --- so long as it is less than 32768 blocks --- for the > first scan, or at least for goal lengths which are powers of two. So > if your application is regularly allocating files which are exactly > 8MB, there are probably some optimizations that we could apply. But > if they aren't exactly 8MB, life gets a bit trickier.) And sadly, they're not always 8MB. If there's anything I can do on the application side to make the filesystem's life easier, I would happily do so, but we're already doing fallocate() and making the writes in a single write() operation. There's not much more I can think of that's low hanging fruit. Cheers, -ben > Regards, > > - Ted -- "Thought is the essence of where you are now." ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: ext4: first write to large ext3 filesystem takes 96 seconds 2014-07-31 14:04 ` Benjamin LaHaise @ 2014-07-31 15:27 ` Theodore Ts'o 0 siblings, 0 replies; 10+ messages in thread From: Theodore Ts'o @ 2014-07-31 15:27 UTC (permalink / raw) To: Benjamin LaHaise; +Cc: Andreas Dilger, linux-ext4@vger.kernel.org On Thu, Jul 31, 2014 at 10:04:34AM -0400, Benjamin LaHaise wrote: > > I'm kvetching mostly because the mballoc code is hugely complicated and > easy to break (and oh have I broken it). If you can point me in the right > direction for possible improvements that you think might improve mballoc, > I'll certainly give them a try. Hopefully the above descriptions of the > workload make it a bit easier to understand what's going on in the big > picture. Yes, the mballoc code is hugely complicated. A lot of this was because there are a lot of special case hacks added over the years to fix various corner cases that have showed up. In particular, some of the magic in normalize_request is probably there for Lustre, and it gives *me* headaches. One of the things which is clearly impacting you is that you need fast failover, where as for most of us, we're either (a) not trying to use large file systems (whenever possible I recommend the use of single disk file systems), or (b) we are less worried about what happens immediately after the file system is mounted, and more about the steady state. > I also don't think this problem is limited to my particular use-case. > Any ext4 filesystem that is 7TB or more and gets up into the 80-90% > utilization will probably start exhibiting this problem. I do wonder if > it is at all possible to fix this issue without replacing the bitmaps used > to track free space with something better suited to the task on such large > filesystems. Pulling in hundreds of megabytes of bitmap blocks is always > going to hurt. Fixing that would mean either compressing the bitmaps into > something that can be read more quickly, or wholesale replacement of the > bitmaps with something else. Yes, it may be that the only solution, if you really want to stick with ext4, is to architect using some kind of extent-based tracking system for block allocation. I wouldn't be horribly against that, since the nice thing about block allocation is that if we lose the extent tree, we can always regenerate the information very easily as part of e2fsck pass 5. So moving to a tree-based allocation tracking system is much easier from a file system recovery perspective than, say, going to a fully dynamic inode table. So if someone were willing to do the engineering work, I wouldn't be opposed to having that be added to ext4. I do have to ask, though, that while I always like to see more people using ext4, and I love to have more people contributing to ext4, have you considered using some other file system? It might be that something like xfs is a much closer match to your requirements? Or perhaps more radically, have you considered going to some cluster file system, not from size perspective (7TB is very cute from a cluster fs perspective), but from a reliability and robustness against server failure perspective. Cheers, - Ted ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2014-07-31 15:27 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-07-07 21:13 ext4: first write to large ext3 filesystem takes 96 seconds Benjamin LaHaise 2014-07-08 0:16 ` Theodore Ts'o 2014-07-08 1:35 ` Benjamin LaHaise 2014-07-08 3:54 ` Theodore Ts'o 2014-07-08 14:53 ` Benjamin LaHaise 2014-07-08 5:11 ` Andreas Dilger 2014-07-30 14:49 ` Benjamin LaHaise 2014-07-31 13:03 ` Theodore Ts'o 2014-07-31 14:04 ` Benjamin LaHaise 2014-07-31 15:27 ` Theodore Ts'o
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).