* ext4: indirect block allocations not sequential in 3.4.67 and 3.11.7 @ 2014-01-15 19:28 Benjamin LaHaise 2014-01-15 20:22 ` Darrick J. Wong 0 siblings, 1 reply; 9+ messages in thread From: Benjamin LaHaise @ 2014-01-15 19:28 UTC (permalink / raw) To: linux-ext4 Hi folks, As a follow on to my previous issue with ext3, it's looking like the indirect block allocator in ext4 is not doing a very good job of making block allocations sequential. On a 1GB test filesystem, I'm getting the following allocation results for 10MB files (written out with a single 10MB write()): debugfs: stat testfile.0 Inode: 12 Type: regular Mode: 0600 Flags: 0x0 Generation: 2584871807 User: 0 Group: 0 Size: 10485760 File ACL: 0 Directory ACL: 0 Links: 1 Blockcount: 20512 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x52d6de73 -- Wed Jan 15 14:16:03 2014 atime: 0x52d6de27 -- Wed Jan 15 14:14:47 2014 mtime: 0x52d6de73 -- Wed Jan 15 14:16:03 2014 BLOCKS: (0-11):24576-24587, (IND):8797, (12-1035):24588-25611, (DIND):8798, (IND):8799, (1036-2059):25612-26635, (IND):10248, (2060-2559):26636-27135 TOTAL: 2564 debugfs: stat testfile.1 Inode: 15 Type: regular Mode: 0600 Flags: 0x0 Generation: 1625569093 User: 0 Group: 0 Size: 10485760 File ACL: 0 Directory ACL: 0 Links: 1 Blockcount: 20512 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x52d6df0f -- Wed Jan 15 14:18:39 2014 atime: 0x52d6df0f -- Wed Jan 15 14:18:39 2014 mtime: 0x52d6df0f -- Wed Jan 15 14:18:39 2014 BLOCKS: (0-11):12288-12299, (IND):8787, (12-1035):12300-13323, (DIND):8790, (IND):8791, (1036-2059):13324-14347, (IND):8789, (2060-2559):14348-14847 TOTAL: 2564 debugfs: To give folks an idea about how significant an impact on performance this is, using ext4 to mount my ext3 filesystem and create files is resulting in a 10-15% reduction in speed when data is being read back into memory. I also tested 3.11.7 and see the same poor allocation layout. I also tried turning off delalloc, but there was no change in the layout of the data blocks. Has anyone got any ideas what's going on here? Cheers, -ben -- "Thought is the essence of where you are now." ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: ext4: indirect block allocations not sequential in 3.4.67 and 3.11.7 2014-01-15 19:28 ext4: indirect block allocations not sequential in 3.4.67 and 3.11.7 Benjamin LaHaise @ 2014-01-15 20:22 ` Darrick J. Wong 2014-01-15 20:32 ` Benjamin LaHaise 0 siblings, 1 reply; 9+ messages in thread From: Darrick J. Wong @ 2014-01-15 20:22 UTC (permalink / raw) To: Benjamin LaHaise; +Cc: linux-ext4 On Wed, Jan 15, 2014 at 02:28:02PM -0500, Benjamin LaHaise wrote: > Hi folks, > > As a follow on to my previous issue with ext3, it's looking like the > indirect block allocator in ext4 is not doing a very good job of making > block allocations sequential. On a 1GB test filesystem, I'm getting > the following allocation results for 10MB files (written out with a single > 10MB write()): > > debugfs: stat testfile.0 > Inode: 12 Type: regular Mode: 0600 Flags: 0x0 Generation: 2584871807 > User: 0 Group: 0 Size: 10485760 > File ACL: 0 Directory ACL: 0 > Links: 1 Blockcount: 20512 > Fragment: Address: 0 Number: 0 Size: 0 > ctime: 0x52d6de73 -- Wed Jan 15 14:16:03 2014 > atime: 0x52d6de27 -- Wed Jan 15 14:14:47 2014 > mtime: 0x52d6de73 -- Wed Jan 15 14:16:03 2014 > BLOCKS: > (0-11):24576-24587, (IND):8797, (12-1035):24588-25611, (DIND):8798, (IND):8799, > (1036-2059):25612-26635, (IND):10248, (2060-2559):26636-27135 > TOTAL: 2564 A dumpe2fs would be nice, but I think I have enough here to speculate: The data blocks are all sequential, which looks like what one would expect from mballoc. Is your complaint is that the *IND blocks are not inline with the data blocks, like what ext3 did? FWIW, ext3 did something like this: (0-11):6144-6155, (IND):6156, (12-1035):6157-7180, (DIND):7181, (IND):7182, (1036-2059):7183-8206, (IND):8207, (2060-2559):8208-8707 I think the behavior that you're seeing is ext4 trying to keep the mapping blocks close to the inode table to avoid fragmenting the file -- see ext4_find_near() in indirect.c. There's an XXX comment in ext4_find_goal() that implies that someone might have wanted to tie in with mballoc, which I suppose you could use to restore the ext3 behavior... but there's no way to do that. --D > > debugfs: stat testfile.1 > Inode: 15 Type: regular Mode: 0600 Flags: 0x0 Generation: 1625569093 > User: 0 Group: 0 Size: 10485760 > File ACL: 0 Directory ACL: 0 > Links: 1 Blockcount: 20512 > Fragment: Address: 0 Number: 0 Size: 0 > ctime: 0x52d6df0f -- Wed Jan 15 14:18:39 2014 > atime: 0x52d6df0f -- Wed Jan 15 14:18:39 2014 > mtime: 0x52d6df0f -- Wed Jan 15 14:18:39 2014 > BLOCKS: > (0-11):12288-12299, (IND):8787, (12-1035):12300-13323, (DIND):8790, (IND):8791, > (1036-2059):13324-14347, (IND):8789, (2060-2559):14348-14847 > TOTAL: 2564 > > debugfs: > > To give folks an idea about how significant an impact on performance this > is, using ext4 to mount my ext3 filesystem and create files is resulting > in a 10-15% reduction in speed when data is being read back into memory. > I also tested 3.11.7 and see the same poor allocation layout. I also > tried turning off delalloc, but there was no change in the layout of the > data blocks. Has anyone got any ideas what's going on here? Cheers, > > -ben > -- > "Thought is the essence of where you are now." > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: ext4: indirect block allocations not sequential in 3.4.67 and 3.11.7 2014-01-15 20:22 ` Darrick J. Wong @ 2014-01-15 20:32 ` Benjamin LaHaise 2014-01-15 21:56 ` Benjamin LaHaise 0 siblings, 1 reply; 9+ messages in thread From: Benjamin LaHaise @ 2014-01-15 20:32 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-ext4 On Wed, Jan 15, 2014 at 12:22:14PM -0800, Darrick J. Wong wrote: > A dumpe2fs would be nice, but I think I have enough here to speculate: It's trivial to reproduce. Just create a 1GB file, run mkfs.ext3, then mount with ext4 and dd a 10MB file onto the filesystem. > The data blocks are all sequential, which looks like what one would expect from > mballoc. Is your complaint is that the *IND blocks are not inline with the > data blocks, like what ext3 did? The problem is that the indirect blocks are nowhere near where the file's data is. It'd be perfectly okay if they were at the beginning of the range of blocks used for the file's data. > FWIW, ext3 did something like this: > (0-11):6144-6155, (IND):6156, (12-1035):6157-7180, (DIND):7181, (IND):7182, > (1036-2059):7183-8206, (IND):8207, (2060-2559):8208-8707 > > I think the behavior that you're seeing is ext4 trying to keep the mapping > blocks close to the inode table to avoid fragmenting the file -- see > ext4_find_near() in indirect.c. There's an XXX comment in ext4_find_goal() > that implies that someone might have wanted to tie in with mballoc, which I > suppose you could use to restore the ext3 behavior... but there's no way to do > that. ... I tried a few tests setting goal to different things, but evidently I'm not managing to convince mballoc to put the file's data close to my goal block, something in that mess of complicated logic is making it ignore the goal value I'm passing in. -ben -- "Thought is the essence of where you are now." ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: ext4: indirect block allocations not sequential in 3.4.67 and 3.11.7 2014-01-15 20:32 ` Benjamin LaHaise @ 2014-01-15 21:56 ` Benjamin LaHaise 2014-01-16 3:54 ` Theodore Ts'o 0 siblings, 1 reply; 9+ messages in thread From: Benjamin LaHaise @ 2014-01-15 21:56 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-ext4 On Wed, Jan 15, 2014 at 03:32:05PM -0500, Benjamin LaHaise wrote: > I tried a few tests setting goal to different things, but evidently I'm > not managing to convince mballoc to put the file's data close to my goal > block, something in that mess of complicated logic is making it ignore > the goal value I'm passing in. It appears that ext4_new_meta_blocks() essentially ignores the goal block specified for metadata blocks. If I hack around things and pass in the EXT4_MB_HINT_TRY_GOAL flag where ext4_new_meta_blocks() is called in ext4_alloc_blocks(), then it will at least try to allocate the block specified by goal. However, if the block specified by goal is not free, it ends up allocating blocks many megabytes away, even if one is free within a few blocks of goal. -ben -- "Thought is the essence of where you are now." ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: ext4: indirect block allocations not sequential in 3.4.67 and 3.11.7 2014-01-15 21:56 ` Benjamin LaHaise @ 2014-01-16 3:54 ` Theodore Ts'o 2014-01-16 18:48 ` Benjamin LaHaise 0 siblings, 1 reply; 9+ messages in thread From: Theodore Ts'o @ 2014-01-16 3:54 UTC (permalink / raw) To: Benjamin LaHaise; +Cc: Darrick J. Wong, linux-ext4 On Wed, Jan 15, 2014 at 04:56:13PM -0500, Benjamin LaHaise wrote: > On Wed, Jan 15, 2014 at 03:32:05PM -0500, Benjamin LaHaise wrote: > > I tried a few tests setting goal to different things, but evidently I'm > > not managing to convince mballoc to put the file's data close to my goal > > block, something in that mess of complicated logic is making it ignore > > the goal value I'm passing in. > > It appears that ext4_new_meta_blocks() essentially ignores the goal block > specified for metadata blocks. If I hack around things and pass in the > EXT4_MB_HINT_TRY_GOAL flag where ext4_new_meta_blocks() is called in > ext4_alloc_blocks(), then it will at least try to allocate the block > specified by goal. However, if the block specified by goal is not free, > it ends up allocating blocks many megabytes away, even if one is free > within a few blocks of goal. I don't remember who sent in the patch to make this change, but the goal of this change (which was deliberate) was to speed up operations such as deletes, since the indirect blocks would be (ideally) close together. If I recall correctly, the person who made this change was more concerned about random read/write workloads than sequential workloads. He or she did make the assertion that in general the triple indirect and double indirect blocks would be tend to be flushed out of memory anyway. Looking back, I'm not sure how strong that particular argument really was, but I don't think we really spent a lot time focusing on that argument, given that extents were what was going to give the very clear win. Something that might be worth experimenting with is extending the EXT4_IOC_PRECACHE_EXTENTS to support indirect blocks mapped file. If we have managed to keep all of the indirect blocks close together at the beginning of the flex_bg, and if we have indeed succeeded in keeping the data blocks contiguous on disk, then sucking in all of the indirect blocks and distilling it into a few extent status cache entries might be the best way to accelerate performance. If we can keep the data blocks for the multi-gigabyte file completely contiguous on disk, then all of the indirect blocks (or extent tree) can be stored in memory in a single 40 byte data structure. (Of course, with a legacy ext3 file system layout, the 128 megs or so the data blocks will be broken up by the block group metadata --- this is one of the reasons why we implemented the flex_bg feature in ext4, to relax the requirement that the inode table and allocation bitmaps for a block group have to be stored in the block group. Still, using 320 bytes of memory for each 1G file is not too shabby.) That way, we get the best of both worlds; because the indirect blocks are close to each other (instead of being inline with the data blocks) things like deleting the file will be fast. But so will precaching all of the logical->physical block data, since we can read all of the indirect blocks in at once, and then store it in memory in a highly compacted form in the extents status cache. Regards, - Ted ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: ext4: indirect block allocations not sequential in 3.4.67 and 3.11.7 2014-01-16 3:54 ` Theodore Ts'o @ 2014-01-16 18:48 ` Benjamin LaHaise 2014-01-16 19:12 ` Theodore Ts'o 0 siblings, 1 reply; 9+ messages in thread From: Benjamin LaHaise @ 2014-01-16 18:48 UTC (permalink / raw) To: Theodore Ts'o; +Cc: Darrick J. Wong, linux-ext4 Hi Ted, On Wed, Jan 15, 2014 at 10:54:59PM -0500, Theodore Ts'o wrote: > On Wed, Jan 15, 2014 at 04:56:13PM -0500, Benjamin LaHaise wrote: > > On Wed, Jan 15, 2014 at 03:32:05PM -0500, Benjamin LaHaise wrote: > > > I tried a few tests setting goal to different things, but evidently I'm > > > not managing to convince mballoc to put the file's data close to my goal > > > block, something in that mess of complicated logic is making it ignore > > > the goal value I'm passing in. > > > > It appears that ext4_new_meta_blocks() essentially ignores the goal block > > specified for metadata blocks. If I hack around things and pass in the > > EXT4_MB_HINT_TRY_GOAL flag where ext4_new_meta_blocks() is called in > > ext4_alloc_blocks(), then it will at least try to allocate the block > > specified by goal. However, if the block specified by goal is not free, > > it ends up allocating blocks many megabytes away, even if one is free > > within a few blocks of goal. > > I don't remember who sent in the patch to make this change, but the > goal of this change (which was deliberate) was to speed up operations > such as deletes, since the indirect blocks would be (ideally) close > together. If I recall correctly, the person who made this change was > more concerned about random read/write workloads than sequential > workloads. He or she did make the assertion that in general the > triple indirect and double indirect blocks would be tend to be flushed > out of memory anyway. Any idea when this commit was made or titled? I care about random performance as well, but that can't be at the cost of making sequential reads suck. > Looking back, I'm not sure how strong that particular argument really > was, but I don't think we really spent a lot time focusing on that > argument, given that extents were what was going to give the very > clear win. > > Something that might be worth experimenting with is extending the > EXT4_IOC_PRECACHE_EXTENTS to support indirect blocks mapped file. If > we have managed to keep all of the indirect blocks close together at > the beginning of the flex_bg, and if we have indeed succeeded in > keeping the data blocks contiguous on disk, then sucking in all of the > indirect blocks and distilling it into a few extent status cache > entries might be the best way to accelerate performance. The seek to get to the indirect blocks is still a cost that is not present in ext3, meaning that the bar is pretty high to avoid a regression. > If we can keep the data blocks for the multi-gigabyte file completely > contiguous on disk, then all of the indirect blocks (or extent tree) > can be stored in memory in a single 40 byte data structure. (Of > course, with a legacy ext3 file system layout, the 128 megs or so the > data blocks will be broken up by the block group metadata --- this is > one of the reasons why we implemented the flex_bg feature in ext4, to > relax the requirement that the inode table and allocation bitmaps for > a block group have to be stored in the block group. Still, using 320 > bytes of memory for each 1G file is not too shabby.) The files I'm dealing with are usually 8MB in size, and there can be up to 1 million of them. In such a use-case, I don't expect the inodes will always remain cached in memory (some of the systems involved only have 4GB of RAM), so adding another metadata cache won't fix the regression. The crux of the issue is that the indirect blocks are getting placed many *megabytes* away from the data blocks. Incurring a seek for every 4MB of data read seems pretty painful. Putting the metadata closer to the data seems like the right thing to do. And it should help the random i/o case as well. -ben > That way, we get the best of both worlds; because the indirect blocks > are close to each other (instead of being inline with the data blocks) > things like deleting the file will be fast. But so will precaching > all of the logical->physical block data, since we can read all of the > indirect blocks in at once, and then store it in memory in a highly > compacted form in the extents status cache. > > Regards, > > - Ted -- "Thought is the essence of where you are now." ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: ext4: indirect block allocations not sequential in 3.4.67 and 3.11.7 2014-01-16 18:48 ` Benjamin LaHaise @ 2014-01-16 19:12 ` Theodore Ts'o 2014-01-16 19:30 ` Benjamin LaHaise 2014-01-20 20:52 ` Eric Sandeen 0 siblings, 2 replies; 9+ messages in thread From: Theodore Ts'o @ 2014-01-16 19:12 UTC (permalink / raw) To: Benjamin LaHaise; +Cc: Darrick J. Wong, linux-ext4 On Thu, Jan 16, 2014 at 01:48:26PM -0500, Benjamin LaHaise wrote: > > Any idea when this commit was made or titled? I care about random > performance as well, but that can't be at the cost of making sequential > reads suck. Thinking about this some more, I think it was made as part of the changes to better take advantage of the flex_bg feature in ext4. The idea was to keep metadata blocks such as directory blocks and extent trees closer together. I don't think when we made that change we really consciously thought that much about indirect block support, since that was viewed as a legacy feature for backwards compatibility support in ext4. (This was years ago, before distributions started wanting to support only one code base for ext3 and ext4 file systems.) I *know* we've had this discussion about whether to put the indirect blocks inline with the data, or closer together to speed up metadata operations (i.e., unlink, fsck, etc.) before, though. There was a patch against ext3 I remember looking at which forced the indirect blocks to the end of the previous block group. That kept the indirect blocks closer together, and on average 64MB away from the data blocks. As I recall, the stated reason for the patch was to make unlinks of backups of DVD images not take forever and a day. I'm pretty sure we've had it at least once on the weekly ext4 concalls, and I'm pretty sure we've had it one hallway track or another. Ultimately, extents are such a huge win that it's not clear it's really worth that much effort to try to optimize indirect blocks, which are a lose no matter how you slice and dice things. > The files I'm dealing with are usually 8MB in size, and there can be up > to 1 million of them. In such a use-case, I don't expect the inodes will > always remain cached in memory (some of the systems involved only have > 4GB of RAM), so adding another metadata cache won't fix the regression. > The crux of the issue is that the indirect blocks are getting placed many > *megabytes* away from the data blocks. Incurring a seek for every 4MB > of data read seems pretty painful. Putting the metadata closer to the > data seems like the right thing to do. And it should help the random > i/o case as well. An 8MB file will require two indirect blocks. If you are using extents, almost certainly it will fit inside the inode, which means we don't need any external metadata blocks. That massively speeds up fsck time, and unlink time, and it also speeds up the random read case since the best way to optimize a seek is to eliminate it. :-) I understand that for your use case, it would be hard to move to using extents right away. But I think you'd see so many improvements from going to ext4 and extents that it might be more efficient to optimize an indirect blocok scheme. - Ted ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: ext4: indirect block allocations not sequential in 3.4.67 and 3.11.7 2014-01-16 19:12 ` Theodore Ts'o @ 2014-01-16 19:30 ` Benjamin LaHaise 2014-01-20 20:52 ` Eric Sandeen 1 sibling, 0 replies; 9+ messages in thread From: Benjamin LaHaise @ 2014-01-16 19:30 UTC (permalink / raw) To: Theodore Ts'o; +Cc: Darrick J. Wong, linux-ext4 On Thu, Jan 16, 2014 at 02:12:27PM -0500, Theodore Ts'o wrote: > An 8MB file will require two indirect blocks. If you are using > extents, almost certainly it will fit inside the inode, which means we > don't need any external metadata blocks. That massively speeds up > fsck time, and unlink time, and it also speeds up the random read case > since the best way to optimize a seek is to eliminate it. :-) > I understand that for your use case, it would be hard to move to using > extents right away. But I think you'd see so many improvements from > going to ext4 and extents that it might be more efficient to optimize > an indirect blocok scheme. Unfortunately, the improvements from extents for our use-case are not enough to outweigh the other costs of deployment. I think I've figured out a hack that results in the system doing most of what I want it to do: I've removed EXT4_MB_HINT_DATA in ext4_alloc_blocks(). With that change, the allocator is giving me mostly sequential allocations. Hopefully that doesn't have any other negative side effects. -ben > - Ted -- "Thought is the essence of where you are now." ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: ext4: indirect block allocations not sequential in 3.4.67 and 3.11.7 2014-01-16 19:12 ` Theodore Ts'o 2014-01-16 19:30 ` Benjamin LaHaise @ 2014-01-20 20:52 ` Eric Sandeen 1 sibling, 0 replies; 9+ messages in thread From: Eric Sandeen @ 2014-01-20 20:52 UTC (permalink / raw) To: Theodore Ts'o, Benjamin LaHaise; +Cc: Darrick J. Wong, linux-ext4 On 1/16/14, 1:12 PM, Theodore Ts'o wrote: > On Thu, Jan 16, 2014 at 01:48:26PM -0500, Benjamin LaHaise wrote: >> >> Any idea when this commit was made or titled? I care about random >> performance as well, but that can't be at the cost of making sequential >> reads suck. > > Thinking about this some more, I think it was made as part of the > changes to better take advantage of the flex_bg feature in ext4. The > idea was to keep metadata blocks such as directory blocks and extent > trees closer together. I don't think when we made that change we > really consciously thought that much about indirect block support, > since that was viewed as a legacy feature for backwards compatibility > support in ext4. (This was years ago, before distributions started > wanting to support only one code base for ext3 and ext4 file systems.) Just to nitpick, wasn't this always the plan? ;) https://lkml.org/lkml/2006/6/28/454 : > 4) At some point, probably in 6-9 months when we are satisified with the > set of features that have been added to fs/ext4, and confident that the > filesystem format has stablized, we will submit a patch which causes the > fs/ext4 code to register itself as the ext4 filesystem. -Eric p.s. "6-9 months" ;) ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2014-01-20 20:52 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-01-15 19:28 ext4: indirect block allocations not sequential in 3.4.67 and 3.11.7 Benjamin LaHaise 2014-01-15 20:22 ` Darrick J. Wong 2014-01-15 20:32 ` Benjamin LaHaise 2014-01-15 21:56 ` Benjamin LaHaise 2014-01-16 3:54 ` Theodore Ts'o 2014-01-16 18:48 ` Benjamin LaHaise 2014-01-16 19:12 ` Theodore Ts'o 2014-01-16 19:30 ` Benjamin LaHaise 2014-01-20 20:52 ` Eric Sandeen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).