* Reviewing ext3 improvement patches (delalloc, mballoc, extents) @ 2005-03-03 8:33 Suparna Bhattacharya 2005-03-03 9:40 ` Andreas Dilger 2005-03-04 1:12 ` [Ext2-devel] " Badari Pulavarty 0 siblings, 2 replies; 24+ messages in thread From: Suparna Bhattacharya @ 2005-03-03 8:33 UTC (permalink / raw) To: ext2-devel, linux-fsdevel; +Cc: Alex Tomas Since the performance improvements seen so far are quite encouraging, and momentum is picking up so well, I started looking through the patches from Alex ... just a quick code walkthrough to get a hang of it and think about what kind of simplifications might be possible and what it might take for inclusion. I haven't had a chance to go too deep line by line yet, but thought I'd initiate some discussion with some first impressions and summary of what directions I hear several people converging towards to validate if I'm on the right track here. diffstat of the 3 patches : 22 files changed, 5920 insertions(+), 47 deletions. The largest is in the extents patch (2743), mballoc is 1968, and delalloc is 1209. To use delalloc, which gives us all the performance benefits, right now we need all the 3 patches to be used in conjunction. Supporting extent map btrees as well as traditional indexing and associated options for compatibility etc is perhaps the more invasive of changes. Given that keeping ext3 stable and maintainable is a key concern (that is after all a major reason why a lot of users rely on ext3), a somewhat incremental approach is desirable. So, I'll start from the direction that has been suggested by some -- (1) delayed allocation without changing the on-disk format. And then later (2) go on to breaking format with all changes for scalability to larger files with full extents support (haven't thought enough about this yet - maybe in a separate mail) A few random things that come to mind for (1), going through the code: - There might be possibilities for code reduction, by extending generic routines as far as possible, e.g. ext3_wb_writepages has a lot in common with generic writepages. That would also make it easier to maintain. - Similarly, how about (as Mingming I think already hinted) implementing ext3_get_blocks to do multi-block lookup and allocation and using it in delalloc ? Hmm, maybe I speak too soon - have to look at the interfaces more closely and verify if this is feasible. Regards Suparna -- Suparna Bhattacharya (suparna@in.ibm.com) Linux Technology Center IBM Software Lab, India ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Reviewing ext3 improvement patches (delalloc, mballoc, extents) 2005-03-03 8:33 Reviewing ext3 improvement patches (delalloc, mballoc, extents) Suparna Bhattacharya @ 2005-03-03 9:40 ` Andreas Dilger 2005-03-03 22:10 ` Theodore Ts'o 2005-03-04 11:13 ` Suparna Bhattacharya 2005-03-04 1:12 ` [Ext2-devel] " Badari Pulavarty 1 sibling, 2 replies; 24+ messages in thread From: Andreas Dilger @ 2005-03-03 9:40 UTC (permalink / raw) To: Suparna Bhattacharya; +Cc: ext2-devel, linux-fsdevel, Alex Tomas [-- Attachment #1: Type: text/plain, Size: 2380 bytes --] On Mar 03, 2005 14:03 +0530, Suparna Bhattacharya wrote: > diffstat of the 3 patches : 22 files changed, 5920 insertions(+), > 47 deletions. The largest is in the extents patch (2743), mballoc > is 1968, and delalloc is 1209. To use delalloc, which gives us > all the performance benefits, right now we need all the 3 patches > to be used in conjunction. Supporting extent map btrees as well > as traditional indexing and associated options for compatibility etc > is perhaps the more invasive of changes. Given that keeping ext3 > stable and maintainable is a key concern (that is after all a major > reason why a lot of users rely on ext3), a somewhat incremental > approach is desirable. > > So, I'll start from the direction that has been suggested by > some -- (1) delayed allocation without changing the > on-disk format. And then later (2) go on to breaking format with > all changes for scalability to larger files with full extents > support (haven't thought enough about this yet - maybe in a > separate mail) Well, for a starter, the extents format changes are not forced on users, only if they mount with "-o extents" and write files will it mark the superblock incompatible and start allocating files this way. I believe (though I have never tested) that even if extents are enabled, writes to a block-mapped file will continue to work and that file will not be converted to an extent file. > A few random things that come to mind for (1), going through the code: > > - There might be possibilities for code reduction, by extending > generic routines as far as possible, e.g. ext3_wb_writepages > has a lot in common with generic writepages. That would > also make it easier to maintain. I'm sure some support for this could be gotten from e.g. XFS as well, since their filesystem (on Irix at least) was all about delayed alloc (not sure what it does under Linux), and I believe ReiserFS/Reiser4 also desire the ability to have delayed allocation from the VFS (i.e. some sort of light-weight "reserve space" call for each page dirtied and then getting the actual file + offsets en masse later (if the VFS/VM doesn't discard the whole thing). Cheers, Andreas -- Andreas Dilger http://sourceforge.net/projects/ext2resize/ http://members.shaw.ca/adilger/ http://members.shaw.ca/golinux/ [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Re: Reviewing ext3 improvement patches (delalloc, mballoc, extents) 2005-03-03 9:40 ` Andreas Dilger @ 2005-03-03 22:10 ` Theodore Ts'o 2005-03-03 22:30 ` Alex Tomas 2005-03-04 11:13 ` Suparna Bhattacharya 1 sibling, 1 reply; 24+ messages in thread From: Theodore Ts'o @ 2005-03-03 22:10 UTC (permalink / raw) To: Suparna Bhattacharya, ext2-devel, linux-fsdevel, Alex Tomas On Thu, Mar 03, 2005 at 02:40:21AM -0700, Andreas Dilger wrote: > > Well, for a starter, the extents format changes are not forced on > users, only if they mount with "-o extents" and write files will > it mark the superblock incompatible and start allocating files > this way. I believe (though I have never tested) that even if > extents are enabled, writes to a block-mapped file will continue > to work and that file will not be converted to an extent file. > I was about to start a new thread discussing this, but you started commenting about it here, so I'll address it here. The way most of the other ext3 extensions that involve feature changes work is that you enable them by using tune2fs (for example, "tune2fs -O dir_index /dev/hdaXX"), instead of using a mount option which then causes the kernel to automatically flip on an feature incompatble flag. It would be good if the extents were changed to follow this convention for the following reasons: 1) Consistency is less confusing for users, and features like dir_index, has_journal work this way already. 2) Long term, you don't want users to have to specify -o extents as a mount option, or specify "extents" in /etc/fstab in order to make the filesystem work correctly. From an initial examination of the extents code won't be initialized properly if you don't specify -o extents. So I'm pretty sure that if you try to mount a filesystem that has extents, but forget to specify -o extents, things will break in an entertaining fashion. I haven't yet tried it yet, though. 3) It means that users who are fooling around with the patch will have at least some kind of patched userspace first (or understand how to use debugfs to set the feature manually). This makes it less likely that they will apply the patch, or get it applied for free when they use the -mm tree (once the patches get accepted by Andrew), and then when they specify -o extents, all of sudden the filesystem becomes incompatible and no longer be mountable using standard tools. This is a bit of a pet peeve of mine, since I am still getting personal e-mail from people who were using Red Hat 7, and decided to just "try out" Fedora Core 3, only to find that any filesystems mounted by FC3 could no longer be mountable on their RH7 or RH8 systems. That's why I prefer users to have to do something that obviously acknowledges that they are making a change to their filesystem's compatibility prospects --- and that's something which is a lot more obvious with a "tune2fs -O " command, as compared to using a magic mount option. - Ted ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Re: Reviewing ext3 improvement patches (delalloc, mballoc, extents) 2005-03-03 22:10 ` Theodore Ts'o @ 2005-03-03 22:30 ` Alex Tomas 0 siblings, 0 replies; 24+ messages in thread From: Alex Tomas @ 2005-03-03 22:30 UTC (permalink / raw) To: Theodore Ts'o; +Cc: suparna, ext2-devel, linux-fsdevel, alex On Thu, 3 Mar 2005 17:10:10 -0500 "Theodore Ts'o" <tytso@mit.edu> wrote: > > This is a bit of a pet peeve of mine, since I am still getting > personal e-mail from people who were using Red Hat 7, and decided to > just "try out" Fedora Core 3, only to find that any filesystems > mounted by FC3 could no longer be mountable on their RH7 or RH8 > systems. That's why I prefer users to have to do something that > obviously acknowledges that they are making a change to their > filesystem's compatibility prospects --- and that's something which is > a lot more obvious with a "tune2fs -O " command, as compared to using > a magic mount option. > makes sense for me. I just like to note that mount option was choosed for the only reason: it's simple to use during debugging. thanks, Alex ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Re: Reviewing ext3 improvement patches (delalloc, mballoc, extents) 2005-03-03 9:40 ` Andreas Dilger 2005-03-03 22:10 ` Theodore Ts'o @ 2005-03-04 11:13 ` Suparna Bhattacharya 2005-03-04 12:29 ` Alex Tomas 1 sibling, 1 reply; 24+ messages in thread From: Suparna Bhattacharya @ 2005-03-04 11:13 UTC (permalink / raw) To: ext2-devel, linux-fsdevel, Alex Tomas On Thu, Mar 03, 2005 at 02:40:21AM -0700, Andreas Dilger wrote: > On Mar 03, 2005 14:03 +0530, Suparna Bhattacharya wrote: > > diffstat of the 3 patches : 22 files changed, 5920 insertions(+), > > 47 deletions. The largest is in the extents patch (2743), mballoc > > is 1968, and delalloc is 1209. To use delalloc, which gives us > > all the performance benefits, right now we need all the 3 patches > > to be used in conjunction. Supporting extent map btrees as well > > as traditional indexing and associated options for compatibility etc > > is perhaps the more invasive of changes. Given that keeping ext3 > > stable and maintainable is a key concern (that is after all a major > > reason why a lot of users rely on ext3), a somewhat incremental > > approach is desirable. > > > > So, I'll start from the direction that has been suggested by > > some -- (1) delayed allocation without changing the > > on-disk format. And then later (2) go on to breaking format with > > all changes for scalability to larger files with full extents > > support (haven't thought enough about this yet - maybe in a > > separate mail) > > Well, for a starter, the extents format changes are not forced on > users, only if they mount with "-o extents" and write files will > it mark the superblock incompatible and start allocating files > this way. I believe (though I have never tested) that even if > extents are enabled, writes to a block-mapped file will continue > to work and that file will not be converted to an extent file. Files that are created with extents will not be viewable by an older kernel, though (I think) - which is where the format breakage comes in (is that correct ?). But I don't see this as a major issue, since it can perhaps be taken care of through a little bit of migration tooling as Ted indicated. So, compatibility in itself wasn't the main concern bothering me but how we could make it easier to assure stability & maintainability even with all the cool stuff. For example, if we have both mballoc and regular balloc and similarly extents and regular indexing based on growth patterns (a nice idea, btw), does it multiply the scenarios to verify on the testing front ? Or in dealing with changes in the future ? I'm guessing that this might be one of the things (besides agreement on the disk layout) holding up inclusion of extents, despite the patches being around for a while now .. but then I could be wrong. B-tree based extent maps were mentioned by sct way back in his 2000 paper ! And of course every filesystem out there implements B-trees in its own way. I can see arguments flying both ways ... at what point do we decide to break towards an ext4 ? BTW, has anyone tried playing with the idea of ext4 as not a cp -r fs/ext3 fs/ext4 and edit, but if possible using some layered filesystem techniques to reuse much of ext3 directly, and just override a few operations (like get_blocks for extents etc) where there is a layout impact ? Alex, have you had a chance to prototype your idea of rooting extents in ea ? > > > A few random things that come to mind for (1), going through the code: > > > > - There might be possibilities for code reduction, by extending > > generic routines as far as possible, e.g. ext3_wb_writepages > > has a lot in common with generic writepages. That would > > also make it easier to maintain. > > I'm sure some support for this could be gotten from e.g. XFS as well, > since their filesystem (on Irix at least) was all about delayed alloc > (not sure what it does under Linux), and I believe ReiserFS/Reiser4 > also desire the ability to have delayed allocation from the VFS (i.e. > some sort of light-weight "reserve space" call for each page dirtied > and then getting the actual file + offsets en masse later (if the > VFS/VM doesn't discard the whole thing). *nod* Regards Suparna > > Cheers, Andreas > -- > Andreas Dilger > http://sourceforge.net/projects/ext2resize/ > http://members.shaw.ca/adilger/ http://members.shaw.ca/golinux/ > -- Suparna Bhattacharya (suparna@in.ibm.com) Linux Technology Center IBM Software Lab, India ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Re: Reviewing ext3 improvement patches (delalloc, mballoc, extents) 2005-03-04 11:13 ` Suparna Bhattacharya @ 2005-03-04 12:29 ` Alex Tomas 2005-03-04 18:25 ` [Ext2-devel] " Andreas Dilger 0 siblings, 1 reply; 24+ messages in thread From: Alex Tomas @ 2005-03-04 12:29 UTC (permalink / raw) To: suparna; +Cc: ext2-devel, linux-fsdevel, alex On Fri, 4 Mar 2005 16:43:31 +0530 Suparna Bhattacharya <suparna@in.ibm.com> wrote: > Alex, have you had a chance to prototype your idea of rooting extents > in ea ? I think all you need for this are: 1) allocate EA in ext3_new_inode() 2) write a replacement for ext3_init_tree_desc() just few lines of code 3) write .get_write_access and .mark_buffer_dirty methods again few lines 4) use replacement of ext3_init_tree_desc() in few places thanks, Alex ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [Ext2-devel] Re: Reviewing ext3 improvement patches (delalloc, mballoc, extents) 2005-03-04 12:29 ` Alex Tomas @ 2005-03-04 18:25 ` Andreas Dilger 0 siblings, 0 replies; 24+ messages in thread From: Andreas Dilger @ 2005-03-04 18:25 UTC (permalink / raw) To: Alex Tomas; +Cc: suparna, ext2-devel, linux-fsdevel [-- Attachment #1: Type: text/plain, Size: 1122 bytes --] On Mar 04, 2005 15:29 +0300, Alex Tomas wrote: > On Fri, 4 Mar 2005 16:43:31 +0530 > Suparna Bhattacharya <suparna@in.ibm.com> wrote: > > > Alex, have you had a chance to prototype your idea of rooting extents > > in ea ? > > I think all you need for this are: > > 1) allocate EA in ext3_new_inode() > 2) write a replacement for ext3_init_tree_desc() > just few lines of code > 3) write .get_write_access and .mark_buffer_dirty methods > again few lines > 4) use replacement of ext3_init_tree_desc() in few places This should of course only be done for large inodes. Also, at some point it will consume all of the EA space and we need to use an external block. It might help in some middle cases (i.e. files with more extents than can fit in i_blocks (60 bytes), but less than fit into the large inode space (128 or maybe 384 bytes)) but it might also hurt other things if we need to allocate an EA block for another EA... Cheers, Andreas -- Andreas Dilger http://sourceforge.net/projects/ext2resize/ http://members.shaw.ca/adilger/ http://members.shaw.ca/golinux/ [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [Ext2-devel] Reviewing ext3 improvement patches (delalloc, mballoc, extents) 2005-03-03 8:33 Reviewing ext3 improvement patches (delalloc, mballoc, extents) Suparna Bhattacharya 2005-03-03 9:40 ` Andreas Dilger @ 2005-03-04 1:12 ` Badari Pulavarty 2005-03-04 1:46 ` Mingming Cao ` (2 more replies) 1 sibling, 3 replies; 24+ messages in thread From: Badari Pulavarty @ 2005-03-04 1:12 UTC (permalink / raw) To: suparna; +Cc: ext2-devel, linux-fsdevel, Alex Tomas On Thu, 2005-03-03 at 00:33, Suparna Bhattacharya wrote: > Since the performance improvements seen so far are quite encouraging, > and momentum is picking up so well, I started looking through the > patches from Alex ... just a quick code walkthrough to get a hang > of it and think about what kind of simplifications might be possible > and what it might take for inclusion. > > I haven't had a chance to go too deep line by line yet, > but thought I'd initiate some discussion with some first impressions > and summary of what directions I hear several people converging > towards to validate if I'm on the right track here. > > diffstat of the 3 patches : 22 files changed, 5920 insertions(+), > 47 deletions. The largest is in the extents patch (2743), mballoc > is 1968, and delalloc is 1209. To use delalloc, which gives us > all the performance benefits, right now we need all the 3 patches > to be used in conjunction. Supporting extent map btrees as well > as traditional indexing and associated options for compatibility etc > is perhaps the more invasive of changes. Given that keeping ext3 > stable and maintainable is a key concern (that is after all a major > reason why a lot of users rely on ext3), a somewhat incremental > approach is desirable. > > So, I'll start from the direction that has been suggested by > some -- (1) delayed allocation without changing the > on-disk format. And then later (2) go on to breaking format with > all changes for scalability to larger files with full extents > support (haven't thought enough about this yet - maybe in a > separate mail) > Just doing delayed allocation without multiblock allocation (with the current layout) is not really a useful thing, IMHO. We will benifit few cases, but in general - we moved the block allocation overhead from prepare write to writepages/writepage time. There is a little benifit of not doing journaling twice etc.. but I don't think it would be enough to justify the effort. Isn't it ? So, may be we should look at adding multiblock allocation + delayed allocation to current ext3 layout. Then we can evaluate the benifits of having "extents" etc and then break the layout ? One more thing, we need to keep in mind is - we need to make sure that "ordered" mode also improved - since all our testcode focuses on "writeback" mode and the default mode is "ordered" :( Thanks, Badari ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [Ext2-devel] Reviewing ext3 improvement patches (delalloc, mballoc, extents) 2005-03-04 1:12 ` [Ext2-devel] " Badari Pulavarty @ 2005-03-04 1:46 ` Mingming Cao 2005-03-04 3:26 ` Suparna Bhattacharya 2005-03-14 8:36 ` Werner Almesberger 2005-03-04 11:30 ` [Ext2-devel] " Alex Tomas 2005-03-04 15:02 ` Alex Tomas 2 siblings, 2 replies; 24+ messages in thread From: Mingming Cao @ 2005-03-04 1:46 UTC (permalink / raw) To: Badari Pulavarty; +Cc: suparna, ext2-devel, linux-fsdevel, Alex Tomas On Thu, 2005-03-03 at 17:12 -0800, Badari Pulavarty wrote: > Just doing delayed allocation without multiblock allocation > (with the current layout) is not really a useful thing, IMHO. > We will benifit few cases, but in general - we moved the > block allocation overhead from prepare write to writepages/writepage > time. There is a little benifit of not doing journaling twice etc.. > but I don't think it would be enough to justify the effort. > Isn't it ? > Hi Badari I agree delayed allocation make much sense with multiblock allocation. But I still think itself worth the effort, even without multiple block allocation. If we have a seeky random write application, and if later the application try to fill those holes, we normally will end up pretty ugly file layout. With delayed allocation, we could have better chance to get contigous blocks on disk for that file. I happened found Ted has mentioned this before: http://marc.theaimsgroup.com/?l=ext2-devel&m=107239591117758&w=2 > So, may be we should look at adding multiblock allocation + > delayed allocation to current ext3 layout. Then we can evaluate > the benifits of having "extents" etc and then break the layout ? > Current reservation code could be improved to return back how big the free chunk inside the window, and we could use that to help make ext3_new_blocks()/ext3_get_blocks() happen. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [Ext2-devel] Reviewing ext3 improvement patches (delalloc, mballoc, extents) 2005-03-04 1:46 ` Mingming Cao @ 2005-03-04 3:26 ` Suparna Bhattacharya 2005-03-14 8:36 ` Werner Almesberger 1 sibling, 0 replies; 24+ messages in thread From: Suparna Bhattacharya @ 2005-03-04 3:26 UTC (permalink / raw) To: Mingming Cao Cc: Badari Pulavarty, ext2-devel, linux-fsdevel, Alex Tomas, akpm On Thu, Mar 03, 2005 at 05:46:13PM -0800, Mingming Cao wrote: > On Thu, 2005-03-03 at 17:12 -0800, Badari Pulavarty wrote: > > Just doing delayed allocation without multiblock allocation > > (with the current layout) is not really a useful thing, IMHO. > > We will benifit few cases, but in general - we moved the > > block allocation overhead from prepare write to writepages/writepage > > time. There is a little benifit of not doing journaling twice etc.. > > but I don't think it would be enough to justify the effort. > > Isn't it ? > > > > Hi Badari > > I agree delayed allocation make much sense with multiblock allocation. > But I still think itself worth the effort, even without multiple block > allocation. If we have a seeky random write application, and if later > the application try to fill those holes, we normally will end up pretty > ugly file layout. With delayed allocation, we could have better chance > to get contigous blocks on disk for that file. > > I happened found Ted has mentioned this before: > http://marc.theaimsgroup.com/?l=ext2-devel&m=107239591117758&w=2 > > > So, may be we should look at adding multiblock allocation + > > delayed allocation to current ext3 layout. Then we can evaluate > > the benifits of having "extents" etc and then break the layout ? > > > > Current reservation code could be improved to return back how big the > free chunk inside the window, and we could use that to help make > ext3_new_blocks()/ext3_get_blocks() happen. Yup this is exactly what I was thinking. It'll probably only be a step along the way ... but I am hoping that this will give us a direction to merge these pieces in incrementally, a little at a time, each piece being very well-understood and with demonstrated performance improvements at every step. For example, the next step after the following could be to plug parts of mballoc in to the above, etc ... Does that make sense ? Regards Suparna -- Suparna Bhattacharya (suparna@in.ibm.com) Linux Technology Center IBM Software Lab, India ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [Ext2-devel] Reviewing ext3 improvement patches (delalloc, mballoc, extents) 2005-03-04 1:46 ` Mingming Cao 2005-03-04 3:26 ` Suparna Bhattacharya @ 2005-03-14 8:36 ` Werner Almesberger 2005-03-14 9:04 ` Suparna Bhattacharya 1 sibling, 1 reply; 24+ messages in thread From: Werner Almesberger @ 2005-03-14 8:36 UTC (permalink / raw) To: Mingming Cao Cc: Badari Pulavarty, suparna, ext2-devel, linux-fsdevel, Alex Tomas, abiss-general Mingming Cao wrote: > I agree delayed allocation make much sense with multiblock allocation. > But I still think itself worth the effort, even without multiple block > allocation. On ABISS, we're currently also experimenting with delayed allocation. There, the goal is less to improve overall performance, but to move the accesses out of the synchronous code path for write(2). The code works quite nicely for FAT and ext2, limiting the time it takes to make a write call writing new data to about 4-6 ms on a fairly sluggish machine (plus about 2-4 ms for moving the playout point, which is a separate operation in ABISS), and with eight competing best-effort writers who each enjoy write latencies of some 8 seconds, worst-case, overwriting old data. Of course, this fails horribly on ext3, because it doesn't do anything useful with the journal. Another problem is error handling. Since FAT and ext2 don't have any form of reservation, a full disk isn't detected until it's far too late. So, a VFS-level reservation function would indeed be nice to have. I looked at ext3 delalloc briefly, and while it did indeed improve performance quite nicely, by being tied to ext3 internals, it would be difficult to use in the framework of ABISS, where the code paths are different (e.g. the prepare/commit functions should be as close to no-ops as possible, and leave all the work to the prefetcher thread), and which tries to be relatively file system independent. - Werner -- _________________________________________________________________________ / Werner Almesberger, Buenos Aires, Argentina werner@almesberger.net / /_http://www.almesberger.net/____________________________________________/ ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [Ext2-devel] Reviewing ext3 improvement patches (delalloc, mballoc, extents) 2005-03-14 8:36 ` Werner Almesberger @ 2005-03-14 9:04 ` Suparna Bhattacharya 2005-03-14 15:02 ` Werner Almesberger 0 siblings, 1 reply; 24+ messages in thread From: Suparna Bhattacharya @ 2005-03-14 9:04 UTC (permalink / raw) To: Werner Almesberger Cc: Mingming Cao, Badari Pulavarty, ext2-devel, linux-fsdevel, Alex Tomas, abiss-general On Mon, Mar 14, 2005 at 05:36:58AM -0300, Werner Almesberger wrote: > Mingming Cao wrote: > > I agree delayed allocation make much sense with multiblock allocation. > > But I still think itself worth the effort, even without multiple block > > allocation. > > On ABISS, we're currently also experimenting with delayed allocation. > There, the goal is less to improve overall performance, but to move > the accesses out of the synchronous code path for write(2). > > The code works quite nicely for FAT and ext2, limiting the time it > takes to make a write call writing new data to about 4-6 ms on a > fairly sluggish machine (plus about 2-4 ms for moving the playout > point, which is a separate operation in ABISS), and with eight > competing best-effort writers who each enjoy write latencies of some > 8 seconds, worst-case, overwriting old data. > > Of course, this fails horribly on ext3, because it doesn't do anything > useful with the journal. Another problem is error handling. Since FAT > and ext2 don't have any form of reservation, a full disk isn't detected > until it's far too late. > > So, a VFS-level reservation function would indeed be nice to have. > > I looked at ext3 delalloc briefly, and while it did indeed improve > performance quite nicely, by being tied to ext3 internals, it would > be difficult to use in the framework of ABISS, where the code paths > are different (e.g. the prepare/commit functions should be as close > to no-ops as possible, and leave all the work to the prefetcher > thread), and which tries to be relatively file system independent. I'm looking at whether we can do most of it at VFS level ... with ext3 only taking care of the additional journalling bit - seems quite feasible. There are two reqs (1) reservation (2) changing mpage_writepages to use get_blocks(), which don't seem too hard. ext3 ordered mode will need a bit more thought. Of course, I haven't looked at how ABISS does delayed alloc -- do you have a patch snippet I can look at ? Regards Suparna > > - Werner > > -- > _________________________________________________________________________ > / Werner Almesberger, Buenos Aires, Argentina werner@almesberger.net / > /_http://www.almesberger.net/____________________________________________/ > > > ------------------------------------------------------- > SF email is sponsored by - The IT Product Guide > Read honest & candid reviews on hundreds of IT Products from real users. > Discover which products truly live up to the hype. Start reading now. > http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click > _______________________________________________ > Ext2-devel mailing list > Ext2-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/ext2-devel -- Suparna Bhattacharya (suparna@in.ibm.com) Linux Technology Center IBM Software Lab, India ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [Ext2-devel] Reviewing ext3 improvement patches (delalloc, mballoc, extents) 2005-03-14 9:04 ` Suparna Bhattacharya @ 2005-03-14 15:02 ` Werner Almesberger 2005-03-14 15:43 ` Alex Tomas 0 siblings, 1 reply; 24+ messages in thread From: Werner Almesberger @ 2005-03-14 15:02 UTC (permalink / raw) To: Suparna Bhattacharya Cc: Mingming Cao, Badari Pulavarty, ext2-devel, linux-fsdevel, Alex Tomas, abiss-general Suparna Bhattacharya wrote: > I'm looking at whether we can do most of it at VFS level Do you plan to reserve space as "blocks, somewhere", or as "these specific on-disk locations" ? In ABISS, we did something of the latter kind (in order to make large contiguous allocations also on FAT), and it turned out to be a big mess, because ABISS needed too much support from the file system driver. So we just scrapped that bit :-) > Of course, I haven't looked at how ABISS does delayed alloc -- > do you have a patch snippet I can look at ? I just made a release. The kernel patch is in abiss-7/kernel/abiss.patch It's all in one big patch, sorry. The main purpose of this is to see what we can achieve, so it's not very polished. The main parts: we added a new page flag, PG_delalloc, which basically tells everyone to stay away from that page. There are two purposes: (a) to make sure no allocation happens unless explicitly requested, and (b) prevent the page from being written back while it is still in ABISS' playout buffer. The reason for (b) is that the page gets locked during writeback, which could cause delays if the ABISS-using application then decides to access the page. The "hands off" code is mainly in fs/buffer.c, in the functions __block_commit_write (set the page dirty, then go away), cont_prepare_write (for FAT, do nothing), block_prepare_write (for ext2, do nothing), and then fs/mpage.c:mpage_writepages (skip pages marked for delayed allocation). cont_prepare_write also needs to handle the special case where it has to fill holes in a file. In this case, it simply overrides delayed allocation. This bit will need more work. Since ABISS prefetches pages, cont_prepare_write and cont_prepare_write may now see pages that are already up to date, so they must not zero them. The prefetching happens in fs/abiss/sched_lib.c:abiss_read_page, and writeback in abiss_put_page. We also experimented with leaving the writeback to MM, but that led to OOM far too often. The current solution works quite smoothly even if we tax the system hard. In order to keep things simple, I didn't try to make delayed allocation do anything for writers that don't use ABISS. The life cycle of a page is about as follows: when an application reads or writes a file, ABISS maintains a playout buffer for it, that typically reaches a few hundred kB ahead of the current file position. Pages are prefetched and locked in the playout buffer. The playout buffer is dimensioned that when file data enters the playout buffer, there is enough time for the data to be in memory by the time the application reaches it. ABISS just calls readpage to get the data, which either causes it to be read from disk, or the page to be zeroed, if we're beyond EOF or at a hole. The application accesses the page through the normal VFS functions, so in the case of writing, the prepare/commit process happens. Once the application has accessed the page, and moves the playout buffer beyond it, the page is released and written back to disk. Prefetching and writeback is done in a separate kernel thread, so the application does not get delayed. - Werner -- _________________________________________________________________________ / Werner Almesberger, Buenos Aires, Argentina werner@almesberger.net / /_http://www.almesberger.net/____________________________________________/ ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Reviewing ext3 improvement patches (delalloc, mballoc, extents) 2005-03-14 15:02 ` Werner Almesberger @ 2005-03-14 15:43 ` Alex Tomas 2005-03-14 16:37 ` [Ext2-devel] " Werner Almesberger 0 siblings, 1 reply; 24+ messages in thread From: Alex Tomas @ 2005-03-14 15:43 UTC (permalink / raw) To: Werner Almesberger Cc: Suparna Bhattacharya, Mingming Cao, Badari Pulavarty, ext2-devel, linux-fsdevel, Alex Tomas, abiss-general >>>>> Werner Almesberger (WA) writes: WA> Do you plan to reserve space as "blocks, somewhere", or as "these WA> specific on-disk locations" ? In ABISS, we did something of the WA> latter kind (in order to make large contiguous allocations also on WA> FAT), and it turned out to be a big mess, because ABISS needed too WA> much support from the file system driver. So we just scrapped that WA> bit :-) I see no reason to reserve specific block in ->prepare/->commit in delayed allocation case. We already do this with reservation. The sole point of delayed allocation is to allocate many blocks at once: to minimize fragmentation, to decrease allocator involvement, to avoid allocation at all if the file gets truncated quickly. WA> The main parts: we added a new page flag, PG_delalloc, which WA> basically tells everyone to stay away from that page. There are WA> two purposes: (a) to make sure no allocation happens unless WA> explicitly requested, and (b) prevent the page from being written WA> back while it is still in ABISS' playout buffer. The reason for WA> (b) is that the page gets locked during writeback, which could WA> cause delays if the ABISS-using application then decides to WA> access the page. locked during writeback? PG_writeback should be used instead of PG_locked. thanks, Alex ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [Ext2-devel] Reviewing ext3 improvement patches (delalloc, mballoc, extents) 2005-03-14 15:43 ` Alex Tomas @ 2005-03-14 16:37 ` Werner Almesberger 2005-03-14 17:13 ` Alex Tomas 2005-03-14 22:23 ` Bryan Henderson 0 siblings, 2 replies; 24+ messages in thread From: Werner Almesberger @ 2005-03-14 16:37 UTC (permalink / raw) To: Alex Tomas Cc: Suparna Bhattacharya, Mingming Cao, Badari Pulavarty, ext2-devel, linux-fsdevel, abiss-general Alex Tomas wrote: > I see no reason to reserve specific block in ->prepare/->commit in > delayed allocation case. We already do this with reservation. This seems like a sensible approach to me. Trying to reserve specific blocks in an FS-independent way was what got us in trouble on ABISS. So the plan B is to add this kind of reservation to where it is really lacking (i.e. FAT). Hmm, it's a bit confusing that we call both things "reservation". Well, airlines do this too, "free seating". > locked during writeback? PG_writeback should be used instead of PG_locked. In mpage_writepages, writepage can also get called with the page just PG_locked. - Werner -- _________________________________________________________________________ / Werner Almesberger, Buenos Aires, Argentina werner@almesberger.net / /_http://www.almesberger.net/____________________________________________/ ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [Ext2-devel] Reviewing ext3 improvement patches (delalloc, mballoc, extents) 2005-03-14 16:37 ` [Ext2-devel] " Werner Almesberger @ 2005-03-14 17:13 ` Alex Tomas 2005-03-15 0:28 ` Werner Almesberger 2005-03-14 22:23 ` Bryan Henderson 1 sibling, 1 reply; 24+ messages in thread From: Alex Tomas @ 2005-03-14 17:13 UTC (permalink / raw) To: Werner Almesberger Cc: Alex Tomas, Suparna Bhattacharya, Mingming Cao, Badari Pulavarty, ext2-devel, linux-fsdevel, abiss-general >>>>> Werner Almesberger (WA) writes: >> locked during writeback? PG_writeback should be used instead of PG_locked. WA> In mpage_writepages, writepage can also get called with the page just WA> PG_locked. you can drop PG_locked right as you set PG_writeback, I think thanks, Alex ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Reviewing ext3 improvement patches (delalloc, mballoc, extents) 2005-03-14 17:13 ` Alex Tomas @ 2005-03-15 0:28 ` Werner Almesberger 0 siblings, 0 replies; 24+ messages in thread From: Werner Almesberger @ 2005-03-15 0:28 UTC (permalink / raw) To: Alex Tomas Cc: Suparna Bhattacharya, Mingming Cao, Badari Pulavarty, ext2-devel, linux-fsdevel, abiss-general Alex Tomas wrote: > you can drop PG_locked right as you set PG_writeback, I think Hmm, not sure. mpage_writepage never calls writepage with PG_writeback, only with PG_locked. Also, mpage_writepage calls get_block with PG_locked, so the allocation, which may take a while, holds the lock. This situation is admittedly a bit annoying: on the one hand, "sync" should write all dirty data. On the other hand, if a random user typing "sync" can break performance guarantees, these guarantees aren't very valuable. - Werner -- _________________________________________________________________________ / Werner Almesberger, Buenos Aires, Argentina werner@almesberger.net / /_http://www.almesberger.net/____________________________________________/ ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Reviewing ext3 improvement patches (delalloc, mballoc, extents) 2005-03-14 16:37 ` [Ext2-devel] " Werner Almesberger 2005-03-14 17:13 ` Alex Tomas @ 2005-03-14 22:23 ` Bryan Henderson 2005-03-15 0:42 ` Werner Almesberger 1 sibling, 1 reply; 24+ messages in thread From: Bryan Henderson @ 2005-03-14 22:23 UTC (permalink / raw) To: Werner Almesberger Cc: abiss-general, Alex Tomas, cmm, ext2-devel, linux-fsdevel, pbadari, Suparna Bhattacharya >Hmm, it's a bit confusing that we call both things "reservation". I think "reservation" is wrong for one of them and anyone using it that way should stop. I believe the common terminology is: - choosing the blocks is "placement." - committing the required number of blocks from the resource pool for the instant use is "reservation." - the combination of reservation and placement is "allocation." Obviously, traditional filesystem drivers haven't split placement from reservation, so don't bother to use those terms. Most delaying schemes delay the placement but not the reservation because they don't want to accept the possibility that a write would fail for lack of space after the write() system call succeeded. Even in non-filesystem areas, "allocate" usually means to assign particular resources, while "reserve" just means to make arrangements so that a future allocate will succeed. For example, if you know you need up to 10 blocks of memory to complete a task without deadlocking, but you don't know yet how exactly how many, you would reserve 10 blocks and later, if necessary, allocate the actual blocks. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Reviewing ext3 improvement patches (delalloc, mballoc, extents) 2005-03-14 22:23 ` Bryan Henderson @ 2005-03-15 0:42 ` Werner Almesberger 2005-03-15 21:59 ` Bryan Henderson 0 siblings, 1 reply; 24+ messages in thread From: Werner Almesberger @ 2005-03-15 0:42 UTC (permalink / raw) To: Bryan Henderson Cc: abiss-general, Alex Tomas, cmm, ext2-devel, linux-fsdevel, pbadari, Suparna Bhattacharya Bryan Henderson wrote: > I think "reservation" is wrong for one of them and anyone using it that > way should stop. Hehe, start with ext3 :-) > I believe the common terminology is: Sounds reasonable. The thing with "reservation" is that people use it in daily life with all kinds of meanings, and often with the object of the reservation, e.g. "reserve a seat" (typically a specific seat), "reserve some time" (often not a specific interval), or "reserve a table" (at a restaurant, you don't know which one, but the restaurant staff does). To muddy the issue further, reservations can be more or less firm. E.g. if we "reserve" the next hundred blocks, so that allocation is contiguous, we may want to be able to take them away if some other file needs them. On the other hand, if storage is already committed, but just not on disk yet, that reservation shouldn't be revokable. - Werner -- _________________________________________________________________________ / Werner Almesberger, Buenos Aires, Argentina werner@almesberger.net / /_http://www.almesberger.net/____________________________________________/ ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Reviewing ext3 improvement patches (delalloc, mballoc, extents) 2005-03-15 0:42 ` Werner Almesberger @ 2005-03-15 21:59 ` Bryan Henderson 0 siblings, 0 replies; 24+ messages in thread From: Bryan Henderson @ 2005-03-15 21:59 UTC (permalink / raw) To: Werner Almesberger Cc: abiss-general, Alex Tomas, cmm, ext2-devel, linux-fsdevel, pbadari, Suparna Bhattacharya >Sounds reasonable. The thing with "reservation" is that people use >it in daily life with all kinds of meanings, That's the way it is all over. Normal people are very sloppy in their language. Engineers have to try to narrow the meanings of the common words to avoid totally confusing each other in these complex discussions. But I think "reserve" in common usage is a lot less ambiguous than you say. I believe when you reserve a seat on an airplane, most of the time it isn't a particular seat. When it is, the airline will call it a "seat assignment" and you get it only after you turn your reservation into a purchased ticket. I've never worked in a restaurant, but I've always assumed that when I make a reservation, even the restaurant doesn't know which table it is until I show up. That way, it can load balance and give people choices when they come in. >E.g. if we "reserve" the next hundred blocks, so that allocation is >contiguous, we may want to be able to take them away if some other >file needs them. I would not call that a reservation. I did, incidentally, design such a system once, and I called it "pencilled in." I might also call it preliminary placement. But I agree that reservations can be more or less firm, owing to the fact that sometimes they can be broken, with more or less ease. E.g. you might reserve a megabyte of space for a file, and under pathological conditions still be told when you go to write that there's no space for you and you're screwed. Just like you can get to the restaurant and be told there's no table for you. ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [Ext2-devel] Reviewing ext3 improvement patches (delalloc, mballoc, extents) 2005-03-04 1:12 ` [Ext2-devel] " Badari Pulavarty 2005-03-04 1:46 ` Mingming Cao @ 2005-03-04 11:30 ` Alex Tomas 2005-03-04 15:02 ` Alex Tomas 2 siblings, 0 replies; 24+ messages in thread From: Alex Tomas @ 2005-03-04 11:30 UTC (permalink / raw) To: Badari Pulavarty; +Cc: suparna, ext2-devel, linux-fsdevel, alex On 03 Mar 2005 17:12:14 -0800 Badari Pulavarty <pbadari@us.ibm.com> wrote: > Just doing delayed allocation without multiblock allocation > (with the current layout) is not really a useful thing, IMHO. > We will benifit few cases, but in general - we moved the > block allocation overhead from prepare write to writepages/writepage > time. There is a little benifit of not doing journaling twice etc.. > but I don't think it would be enough to justify the effort. > Isn't it ? one more goodness - if file gets truncated soon, no allocation is needed at all > One more thing, we need to keep in mind is - we need to make sure > that "ordered" mode also improved - since all our testcode > focuses on "writeback" mode and the default mode is "ordered" :( working on that. thanks, Alex ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [Ext2-devel] Reviewing ext3 improvement patches (delalloc, mballoc, extents) 2005-03-04 1:12 ` [Ext2-devel] " Badari Pulavarty 2005-03-04 1:46 ` Mingming Cao 2005-03-04 11:30 ` [Ext2-devel] " Alex Tomas @ 2005-03-04 15:02 ` Alex Tomas 2005-03-13 14:41 ` Delayed alloc for ordered-mode Suparna Bhattacharya 2 siblings, 1 reply; 24+ messages in thread From: Alex Tomas @ 2005-03-04 15:02 UTC (permalink / raw) To: Badari Pulavarty; +Cc: suparna, sct, akpmext2-devel, linux-fsdevel On 03 Mar 2005 17:12:14 -0800 Badari Pulavarty <pbadari@us.ibm.com> wrote: > One more thing, we need to keep in mind is - we need to make sure > that "ordered" mode also improved - since all our testcode > focuses on "writeback" mode and the default mode is "ordered" :( > I've just cooked the patch to implement ordered mode for delayed allocation path. please take it: ftp://ftp.clusterfs.com/pub/people/alex/2.6.11/ext3-delalloc-ordered-2.6.11-0.1.patch Stephen, Andrew could you review it, please? thanks, Alex Index: linux-2.6.11/include/linux/jbd.h =================================================================== --- linux-2.6.11.orig/include/linux/jbd.h 2005-03-02 20:49:13.000000000 +0300 +++ linux-2.6.11/include/linux/jbd.h 2005-03-04 17:03:52.000000000 +0300 @@ -486,6 +486,12 @@ struct journal_head *t_sync_datalist; /* + * Number of BIO's submited in context of the transaction we + * want to complete before committing + */ + atomic_t t_bios_in_flight; + + /* * Doubly-linked circular list of all forget buffers (superseded * buffers which we can un-checkpoint once this transaction commits) * [j_list_lock] @@ -678,6 +684,9 @@ /* Wait queue to wait for updates to complete */ wait_queue_head_t j_wait_updates; + /* Wait queue to wait for all BIOs to complete */ + wait_queue_head_t j_wait_bios; + /* Semaphore for locking against concurrent checkpoints */ struct semaphore j_checkpoint_sem; Index: linux-2.6.11/fs/jbd/commit.c =================================================================== --- linux-2.6.11.orig/fs/jbd/commit.c 2005-03-02 20:49:09.000000000 +0300 +++ linux-2.6.11/fs/jbd/commit.c 2005-03-04 17:53:52.000000000 +0300 @@ -619,6 +620,13 @@ if (is_journal_aborted(journal)) goto skip_commit; + /* + * Before the commit record, we have to wait for all bio's + * ext3_wb_writepages() issued against newly-allocated blocks + */ + wait_event(journal->j_wait_bios, + atomic_read(&commit_transaction->t_bios_in_flight) == 0); + /* Done it all: now write the commit record. We should have * cleaned up our previous buffers by now, so if we are in abort * mode we can now just skip the rest of the journal write Index: linux-2.6.11/fs/jbd/transaction.c =================================================================== --- linux-2.6.11.orig/fs/jbd/transaction.c 2005-03-02 20:49:09.000000000 +0300 +++ linux-2.6.11/fs/jbd/transaction.c 2005-03-04 17:05:28.000000000 +0300 @@ -51,6 +51,7 @@ transaction->t_tid = journal->j_transaction_sequence++; transaction->t_expires = jiffies + journal->j_commit_interval; spin_lock_init(&transaction->t_handle_lock); + atomic_set(&transaction->t_bios_in_flight, 0); /* Set up the commit timer for the new transaction. */ journal->j_commit_timer->expires = transaction->t_expires; Index: linux-2.6.11/fs/jbd/journal.c =================================================================== --- linux-2.6.11.orig/fs/jbd/journal.c 2005-03-04 17:04:29.000000000 +0300 +++ linux-2.6.11/fs/jbd/journal.c 2005-03-04 17:04:40.000000000 +0300 @@ -671,6 +671,7 @@ init_waitqueue_head(&journal->j_wait_checkpoint); init_waitqueue_head(&journal->j_wait_commit); init_waitqueue_head(&journal->j_wait_updates); + init_waitqueue_head(&journal->j_wait_bios); init_MUTEX(&journal->j_barrier); init_MUTEX(&journal->j_checkpoint_sem); spin_lock_init(&journal->j_revoke_lock); Index: linux-2.6.11/fs/ext3/writeback.c =================================================================== --- linux-2.6.11.orig/fs/ext3/writeback.c 2005-03-04 15:10:01.000000000 +0300 +++ linux-2.6.11/fs/ext3/writeback.c 2005-03-04 17:33:05.000000000 +0300 @@ -145,6 +145,17 @@ if (bio->bi_size) return 1; + if (bio->bi_private) { + transaction_t *transaction = bio->bi_private; + + /* + * journal_commit_transaction() may be awaiting + * the bio to complete. + */ + if (atomic_dec_and_test(&transaction->t_bios_in_flight)) + wake_up(&transaction->t_journal->j_wait_bios); + } + do { struct page *page = bvec->bv_page; @@ -162,6 +173,16 @@ static struct bio *ext3_wb_bio_submit(struct bio *bio, handle_t *handle) { bio->bi_end_io = ext3_wb_end_io; + if (handle) { + /* + * In data=ordered we shouldn't commit the transaction + * until all data related to the transaction get on a + * platter. + */ + atomic_inc(&handle->h_transaction->t_bios_in_flight); + bio->bi_private = handle->h_transaction; + } else + bio->bi_private = NULL; submit_bio(WRITE, bio); return NULL; } ^ permalink raw reply [flat|nested] 24+ messages in thread
* Delayed alloc for ordered-mode 2005-03-04 15:02 ` Alex Tomas @ 2005-03-13 14:41 ` Suparna Bhattacharya 2005-03-13 19:32 ` Badari Pulavarty 0 siblings, 1 reply; 24+ messages in thread From: Suparna Bhattacharya @ 2005-03-13 14:41 UTC (permalink / raw) To: Alex Tomas; +Cc: Badari Pulavarty, sct, akpmext2-devel, linux-fsdevel What would be really nice is if we could do this in a way that enables reuse of generic paths even for ordered mode. One thought that comes to mind is journal commit waiting for writeback to complete on the data pages which need to be flushed to disk before meta-data can be committed, much like we do for O_SYNC. I realise that JBD is intended to work at a level of abstraction where it has no awareness of filesystems - hence the correspondence with buffer heads all through. So would the above be a complete no-no ? Regards Suparna On Fri, Mar 04, 2005 at 06:02:35PM +0300, Alex Tomas wrote: > On 03 Mar 2005 17:12:14 -0800 > Badari Pulavarty <pbadari@us.ibm.com> wrote: > > > One more thing, we need to keep in mind is - we need to make sure > > that "ordered" mode also improved - since all our testcode > > focuses on "writeback" mode and the default mode is "ordered" :( > > > > I've just cooked the patch to implement ordered mode for delayed > allocation path. please take it: > > ftp://ftp.clusterfs.com/pub/people/alex/2.6.11/ext3-delalloc-ordered-2.6.11-0.1.patch > > Stephen, Andrew could you review it, please? > > thanks, Alex > > > Index: linux-2.6.11/include/linux/jbd.h > =================================================================== > --- linux-2.6.11.orig/include/linux/jbd.h 2005-03-02 20:49:13.000000000 +0300 > +++ linux-2.6.11/include/linux/jbd.h 2005-03-04 17:03:52.000000000 +0300 > @@ -486,6 +486,12 @@ > struct journal_head *t_sync_datalist; > > /* > + * Number of BIO's submited in context of the transaction we > + * want to complete before committing > + */ > + atomic_t t_bios_in_flight; > + > + /* > * Doubly-linked circular list of all forget buffers (superseded > * buffers which we can un-checkpoint once this transaction commits) > * [j_list_lock] > @@ -678,6 +684,9 @@ > /* Wait queue to wait for updates to complete */ > wait_queue_head_t j_wait_updates; > > + /* Wait queue to wait for all BIOs to complete */ > + wait_queue_head_t j_wait_bios; > + > /* Semaphore for locking against concurrent checkpoints */ > struct semaphore j_checkpoint_sem; > > Index: linux-2.6.11/fs/jbd/commit.c > =================================================================== > --- linux-2.6.11.orig/fs/jbd/commit.c 2005-03-02 20:49:09.000000000 +0300 > +++ linux-2.6.11/fs/jbd/commit.c 2005-03-04 17:53:52.000000000 +0300 > @@ -619,6 +620,13 @@ > if (is_journal_aborted(journal)) > goto skip_commit; > > + /* > + * Before the commit record, we have to wait for all bio's > + * ext3_wb_writepages() issued against newly-allocated blocks > + */ > + wait_event(journal->j_wait_bios, > + atomic_read(&commit_transaction->t_bios_in_flight) == 0); > + > /* Done it all: now write the commit record. We should have > * cleaned up our previous buffers by now, so if we are in abort > * mode we can now just skip the rest of the journal write > Index: linux-2.6.11/fs/jbd/transaction.c > =================================================================== > --- linux-2.6.11.orig/fs/jbd/transaction.c 2005-03-02 20:49:09.000000000 +0300 > +++ linux-2.6.11/fs/jbd/transaction.c 2005-03-04 17:05:28.000000000 +0300 > @@ -51,6 +51,7 @@ > transaction->t_tid = journal->j_transaction_sequence++; > transaction->t_expires = jiffies + journal->j_commit_interval; > spin_lock_init(&transaction->t_handle_lock); > + atomic_set(&transaction->t_bios_in_flight, 0); > > /* Set up the commit timer for the new transaction. */ > journal->j_commit_timer->expires = transaction->t_expires; > Index: linux-2.6.11/fs/jbd/journal.c > =================================================================== > --- linux-2.6.11.orig/fs/jbd/journal.c 2005-03-04 17:04:29.000000000 +0300 > +++ linux-2.6.11/fs/jbd/journal.c 2005-03-04 17:04:40.000000000 +0300 > @@ -671,6 +671,7 @@ > init_waitqueue_head(&journal->j_wait_checkpoint); > init_waitqueue_head(&journal->j_wait_commit); > init_waitqueue_head(&journal->j_wait_updates); > + init_waitqueue_head(&journal->j_wait_bios); > init_MUTEX(&journal->j_barrier); > init_MUTEX(&journal->j_checkpoint_sem); > spin_lock_init(&journal->j_revoke_lock); > Index: linux-2.6.11/fs/ext3/writeback.c > =================================================================== > --- linux-2.6.11.orig/fs/ext3/writeback.c 2005-03-04 15:10:01.000000000 +0300 > +++ linux-2.6.11/fs/ext3/writeback.c 2005-03-04 17:33:05.000000000 +0300 > @@ -145,6 +145,17 @@ > if (bio->bi_size) > return 1; > > + if (bio->bi_private) { > + transaction_t *transaction = bio->bi_private; > + > + /* > + * journal_commit_transaction() may be awaiting > + * the bio to complete. > + */ > + if (atomic_dec_and_test(&transaction->t_bios_in_flight)) > + wake_up(&transaction->t_journal->j_wait_bios); > + } > + > do { > struct page *page = bvec->bv_page; > > @@ -162,6 +173,16 @@ > static struct bio *ext3_wb_bio_submit(struct bio *bio, handle_t *handle) > { > bio->bi_end_io = ext3_wb_end_io; > + if (handle) { > + /* > + * In data=ordered we shouldn't commit the transaction > + * until all data related to the transaction get on a > + * platter. > + */ > + atomic_inc(&handle->h_transaction->t_bios_in_flight); > + bio->bi_private = handle->h_transaction; > + } else > + bio->bi_private = NULL; > submit_bio(WRITE, bio); > return NULL; > } -- Suparna Bhattacharya (suparna@in.ibm.com) Linux Technology Center IBM Software Lab, India ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Delayed alloc for ordered-mode 2005-03-13 14:41 ` Delayed alloc for ordered-mode Suparna Bhattacharya @ 2005-03-13 19:32 ` Badari Pulavarty 0 siblings, 0 replies; 24+ messages in thread From: Badari Pulavarty @ 2005-03-13 19:32 UTC (permalink / raw) To: suparna; +Cc: Alex Tomas, sct, akpm, linux-fsdevel I think adding support to JBD to deal with "bio"s would be a valuable generic extention. Anyway its dealing with "bh"s now. Thanks, Badari Suparna Bhattacharya wrote: > What would be really nice is if we could do this in a way that > enables reuse of generic paths even for ordered mode. One thought > that comes to mind is journal commit waiting for writeback to > complete on the data pages which need to be flushed to disk before > meta-data can be committed, much like we do for O_SYNC. > > I realise that JBD is intended to work at a level of abstraction > where it has no awareness of filesystems - hence the correspondence > with buffer heads all through. So would the above be a complete > no-no ? > > Regards > Suparna > > On Fri, Mar 04, 2005 at 06:02:35PM +0300, Alex Tomas wrote: > >>On 03 Mar 2005 17:12:14 -0800 >>Badari Pulavarty <pbadari@us.ibm.com> wrote: >> >> >>>One more thing, we need to keep in mind is - we need to make sure >>>that "ordered" mode also improved - since all our testcode >>>focuses on "writeback" mode and the default mode is "ordered" :( >>> >> >>I've just cooked the patch to implement ordered mode for delayed >>allocation path. please take it: >> >>ftp://ftp.clusterfs.com/pub/people/alex/2.6.11/ext3-delalloc-ordered-2.6.11-0.1.patch >> >>Stephen, Andrew could you review it, please? >> >>thanks, Alex ^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2005-03-15 21:59 UTC | newest] Thread overview: 24+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2005-03-03 8:33 Reviewing ext3 improvement patches (delalloc, mballoc, extents) Suparna Bhattacharya 2005-03-03 9:40 ` Andreas Dilger 2005-03-03 22:10 ` Theodore Ts'o 2005-03-03 22:30 ` Alex Tomas 2005-03-04 11:13 ` Suparna Bhattacharya 2005-03-04 12:29 ` Alex Tomas 2005-03-04 18:25 ` [Ext2-devel] " Andreas Dilger 2005-03-04 1:12 ` [Ext2-devel] " Badari Pulavarty 2005-03-04 1:46 ` Mingming Cao 2005-03-04 3:26 ` Suparna Bhattacharya 2005-03-14 8:36 ` Werner Almesberger 2005-03-14 9:04 ` Suparna Bhattacharya 2005-03-14 15:02 ` Werner Almesberger 2005-03-14 15:43 ` Alex Tomas 2005-03-14 16:37 ` [Ext2-devel] " Werner Almesberger 2005-03-14 17:13 ` Alex Tomas 2005-03-15 0:28 ` Werner Almesberger 2005-03-14 22:23 ` Bryan Henderson 2005-03-15 0:42 ` Werner Almesberger 2005-03-15 21:59 ` Bryan Henderson 2005-03-04 11:30 ` [Ext2-devel] " Alex Tomas 2005-03-04 15:02 ` Alex Tomas 2005-03-13 14:41 ` Delayed alloc for ordered-mode Suparna Bhattacharya 2005-03-13 19:32 ` Badari Pulavarty
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).