Reviewing ext3 improvement patches (delalloc, mballoc, extents)

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Reviewing ext3 improvement patches (delalloc, mballoc, extents)
@ 2005-03-03  8:33 Suparna Bhattacharya
  2005-03-03  9:40 ` Andreas Dilger
  2005-03-04  1:12 ` [Ext2-devel] " Badari Pulavarty
  0 siblings, 2 replies; 24+ messages in thread
From: Suparna Bhattacharya @ 2005-03-03  8:33 UTC (permalink / raw)
  To: ext2-devel, linux-fsdevel; +Cc: Alex Tomas

Since the performance improvements seen so far are quite encouraging, 
and momentum is picking up so well, I started looking through the
patches from Alex ... just a quick code walkthrough to get a hang
of it and think about what kind of simplifications might be possible
and what it might take for inclusion.

I haven't had a chance to go too deep line by line yet,
but thought I'd initiate some discussion with some first impressions
and summary of what directions I hear several people converging
towards to validate if I'm on the right track here.

diffstat of the 3 patches : 22 files changed, 5920 insertions(+), 
47 deletions. The largest is in the extents patch (2743), mballoc 
is 1968, and delalloc is 1209. To use delalloc, which gives us
all the performance benefits, right now we need all the 3 patches
to be used in conjunction. Supporting extent map btrees as well 
as traditional indexing and associated options for compatibility etc
is perhaps the more invasive of changes. Given that keeping ext3 
stable and maintainable is a key concern (that is after all a major 
reason why a lot of users rely on ext3), a somewhat incremental 
approach is desirable. 

So, I'll start from the direction that has been suggested by
some -- (1) delayed allocation without changing the
on-disk format. And then later (2) go on to breaking format with 
all changes for scalability to larger files with full extents 
support (haven't thought enough about this yet - maybe in a
separate mail)

A few random things that come to mind for (1), going through the code:

- There might be possibilities for code reduction, by extending
  generic routines as far as possible, e.g. ext3_wb_writepages
  has a lot in common with generic writepages. That would
  also make it easier to maintain.
- Similarly, how about (as Mingming I think already hinted) 
  implementing ext3_get_blocks to do multi-block lookup and 
  allocation and using it in delalloc ?

Hmm, maybe I speak too soon - have to look at the interfaces more
closely and verify if this is feasible. 

Regards
Suparna

-- 
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Lab, India

-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Reviewing ext3 improvement patches (delalloc, mballoc, extents)
  2005-03-03  8:33 Reviewing ext3 improvement patches (delalloc, mballoc, extents) Suparna Bhattacharya
@ 2005-03-03  9:40 ` Andreas Dilger
  2005-03-03 22:10   ` Theodore Ts'o
  2005-03-04 11:13   ` Suparna Bhattacharya
  2005-03-04  1:12 ` [Ext2-devel] " Badari Pulavarty
  1 sibling, 2 replies; 24+ messages in thread
From: Andreas Dilger @ 2005-03-03  9:40 UTC (permalink / raw)
  To: Suparna Bhattacharya; +Cc: ext2-devel, linux-fsdevel, Alex Tomas

[-- Attachment #1: Type: text/plain, Size: 2380 bytes --]

On Mar 03, 2005  14:03 +0530, Suparna Bhattacharya wrote:
> diffstat of the 3 patches : 22 files changed, 5920 insertions(+), 
> 47 deletions. The largest is in the extents patch (2743), mballoc 
> is 1968, and delalloc is 1209. To use delalloc, which gives us
> all the performance benefits, right now we need all the 3 patches
> to be used in conjunction. Supporting extent map btrees as well 
> as traditional indexing and associated options for compatibility etc
> is perhaps the more invasive of changes. Given that keeping ext3 
> stable and maintainable is a key concern (that is after all a major 
> reason why a lot of users rely on ext3), a somewhat incremental 
> approach is desirable. 
> 
> So, I'll start from the direction that has been suggested by
> some -- (1) delayed allocation without changing the
> on-disk format. And then later (2) go on to breaking format with 
> all changes for scalability to larger files with full extents 
> support (haven't thought enough about this yet - maybe in a
> separate mail)

Well, for a starter, the extents format changes are not forced on
users, only if they mount with "-o extents" and write files will
it mark the superblock incompatible and start allocating files
this way.  I believe (though I have never tested) that even if
extents are enabled, writes to a block-mapped file will continue
to work and that file will not be converted to an extent file.

> A few random things that come to mind for (1), going through the code:
> 
> - There might be possibilities for code reduction, by extending
>   generic routines as far as possible, e.g. ext3_wb_writepages
>   has a lot in common with generic writepages. That would
>   also make it easier to maintain.

I'm sure some support for this could be gotten from e.g. XFS as well,
since their filesystem (on Irix at least) was all about delayed alloc
(not sure what it does under Linux), and I believe ReiserFS/Reiser4
also desire the ability to have delayed allocation from the VFS (i.e.
some sort of light-weight "reserve space" call for each page dirtied
and then getting the actual file + offsets en masse later (if the
VFS/VM doesn't discard the whole thing).

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://members.shaw.ca/adilger/             http://members.shaw.ca/golinux/


[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Re: Reviewing ext3 improvement patches (delalloc, mballoc, extents)
  2005-03-03  9:40 ` Andreas Dilger
@ 2005-03-03 22:10   ` Theodore Ts'o
  2005-03-03 22:30     ` Alex Tomas
  2005-03-04 11:13   ` Suparna Bhattacharya
  1 sibling, 1 reply; 24+ messages in thread
From: Theodore Ts'o @ 2005-03-03 22:10 UTC (permalink / raw)
  To: Suparna Bhattacharya, ext2-devel, linux-fsdevel, Alex Tomas

On Thu, Mar 03, 2005 at 02:40:21AM -0700, Andreas Dilger wrote:
> 
> Well, for a starter, the extents format changes are not forced on
> users, only if they mount with "-o extents" and write files will
> it mark the superblock incompatible and start allocating files
> this way.  I believe (though I have never tested) that even if
> extents are enabled, writes to a block-mapped file will continue
> to work and that file will not be converted to an extent file.
> 

I was about to start a new thread discussing this, but you started
commenting about it here, so I'll address it here.

The way most of the other ext3 extensions that involve feature changes
work is that you enable them by using tune2fs (for example, "tune2fs
-O dir_index /dev/hdaXX"), instead of using a mount option which then
causes the kernel to automatically flip on an feature incompatble
flag.  It would be good if the extents were changed to follow this
convention for the following reasons:

	1) Consistency is less confusing for users, and features like
	   dir_index, has_journal work this way already.

	2) Long term, you don't want users to have to specify -o
	   extents as a mount option, or specify "extents" in
	   /etc/fstab in order to make the filesystem work correctly.
	   From an initial examination of the extents code won't be
	   initialized properly if you don't specify -o extents.  So
	   I'm pretty sure that if you try to mount a filesystem that
	   has extents, but forget to specify -o extents, things will
	   break in an entertaining fashion.  I haven't yet tried it
	   yet, though.

	3) It means that users who are fooling around with the patch
	   will have at least some kind of patched userspace first (or
	   understand how to use debugfs to set the feature manually).
	   This makes it less likely that they will apply the patch,
	   or get it applied for free when they use the -mm tree (once
	   the patches get accepted by Andrew), and then when they
	   specify -o extents, all of sudden the filesystem becomes
	   incompatible and no longer be mountable using standard
	   tools.

This is a bit of a pet peeve of mine, since I am still getting
personal e-mail from people who were using Red Hat 7, and decided to
just "try out" Fedora Core 3, only to find that any filesystems
mounted by FC3 could no longer be mountable on their RH7 or RH8
systems.  That's why I prefer users to have to do something that
obviously acknowledges that they are making a change to their
filesystem's compatibility prospects --- and that's something which is
a lot more obvious with a "tune2fs -O " command, as compared to using
a magic mount option. 

						- Ted


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Re: Reviewing ext3 improvement patches (delalloc, mballoc, extents)
  2005-03-03 22:10   ` Theodore Ts'o
@ 2005-03-03 22:30     ` Alex Tomas
  0 siblings, 0 replies; 24+ messages in thread
From: Alex Tomas @ 2005-03-03 22:30 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: suparna, ext2-devel, linux-fsdevel, alex

On Thu, 3 Mar 2005 17:10:10 -0500
"Theodore Ts'o" <tytso@mit.edu> wrote:
> 
> This is a bit of a pet peeve of mine, since I am still getting
> personal e-mail from people who were using Red Hat 7, and decided to
> just "try out" Fedora Core 3, only to find that any filesystems
> mounted by FC3 could no longer be mountable on their RH7 or RH8
> systems.  That's why I prefer users to have to do something that
> obviously acknowledges that they are making a change to their
> filesystem's compatibility prospects --- and that's something which is
> a lot more obvious with a "tune2fs -O " command, as compared to using
> a magic mount option. 
> 

makes sense for me. I just like to note that mount option was choosed
for the only reason: it's simple to use during debugging.

thanks, Alex


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Re: Reviewing ext3 improvement patches (delalloc, mballoc, extents)
  2005-03-03  9:40 ` Andreas Dilger
  2005-03-03 22:10   ` Theodore Ts'o
@ 2005-03-04 11:13   ` Suparna Bhattacharya
  2005-03-04 12:29     ` Alex Tomas
  1 sibling, 1 reply; 24+ messages in thread
From: Suparna Bhattacharya @ 2005-03-04 11:13 UTC (permalink / raw)
  To: ext2-devel, linux-fsdevel, Alex Tomas

On Thu, Mar 03, 2005 at 02:40:21AM -0700, Andreas Dilger wrote:
> On Mar 03, 2005  14:03 +0530, Suparna Bhattacharya wrote:
> > diffstat of the 3 patches : 22 files changed, 5920 insertions(+), 
> > 47 deletions. The largest is in the extents patch (2743), mballoc 
> > is 1968, and delalloc is 1209. To use delalloc, which gives us
> > all the performance benefits, right now we need all the 3 patches
> > to be used in conjunction. Supporting extent map btrees as well 
> > as traditional indexing and associated options for compatibility etc
> > is perhaps the more invasive of changes. Given that keeping ext3 
> > stable and maintainable is a key concern (that is after all a major 
> > reason why a lot of users rely on ext3), a somewhat incremental 
> > approach is desirable. 
> > 
> > So, I'll start from the direction that has been suggested by
> > some -- (1) delayed allocation without changing the
> > on-disk format. And then later (2) go on to breaking format with 
> > all changes for scalability to larger files with full extents 
> > support (haven't thought enough about this yet - maybe in a
> > separate mail)
> 
> Well, for a starter, the extents format changes are not forced on
> users, only if they mount with "-o extents" and write files will
> it mark the superblock incompatible and start allocating files
> this way.  I believe (though I have never tested) that even if
> extents are enabled, writes to a block-mapped file will continue
> to work and that file will not be converted to an extent file.

Files that are created with extents will not be viewable by an older
kernel, though (I think) - which is where the format breakage comes
in (is that correct ?). But I don't see this as a major issue, since 
it can perhaps be taken care of through a little bit of migration 
tooling as Ted indicated. 

So, compatibility in itself wasn't the main concern bothering me 
but how we could make it easier to assure stability & maintainability
even with all the cool stuff. For example, if we have both mballoc 
and regular balloc and similarly extents and regular indexing based 
on growth patterns (a nice idea, btw), does it multiply the 
scenarios to verify on the testing front ? Or in dealing with changes
in the future ? I'm guessing that this might be one of the things (besides
agreement on the disk layout) holding up inclusion of extents, despite
the patches being around for a while now .. but then I could be wrong.
B-tree based extent maps were mentioned by sct way back in his 2000 
paper ! And of course every filesystem out there implements B-trees in
its own way.

I can see arguments flying both ways ... at what point do we decide
to break towards an ext4 ? 

BTW, has anyone tried playing with the idea of ext4 as not a 
cp -r fs/ext3 fs/ext4 and edit, but if possible using some layered
filesystem techniques to reuse much of ext3 directly, and just override
a few operations (like get_blocks for extents etc) where there 
is a layout impact ? 

Alex, have you had a chance to prototype your idea of rooting extents
in ea ?

> 
> > A few random things that come to mind for (1), going through the code:
> > 
> > - There might be possibilities for code reduction, by extending
> >   generic routines as far as possible, e.g. ext3_wb_writepages
> >   has a lot in common with generic writepages. That would
> >   also make it easier to maintain.
> 
> I'm sure some support for this could be gotten from e.g. XFS as well,
> since their filesystem (on Irix at least) was all about delayed alloc
> (not sure what it does under Linux), and I believe ReiserFS/Reiser4
> also desire the ability to have delayed allocation from the VFS (i.e.
> some sort of light-weight "reserve space" call for each page dirtied
> and then getting the actual file + offsets en masse later (if the
> VFS/VM doesn't discard the whole thing).

*nod*

Regards
Suparna

> 
> Cheers, Andreas
> --
> Andreas Dilger
> http://sourceforge.net/projects/ext2resize/
> http://members.shaw.ca/adilger/             http://members.shaw.ca/golinux/
> 

-- 
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Lab, India

-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Re: Reviewing ext3 improvement patches (delalloc, mballoc, extents)
  2005-03-04 11:13   ` Suparna Bhattacharya
@ 2005-03-04 12:29     ` Alex Tomas
  2005-03-04 18:25       ` [Ext2-devel] " Andreas Dilger
  0 siblings, 1 reply; 24+ messages in thread
From: Alex Tomas @ 2005-03-04 12:29 UTC (permalink / raw)
  To: suparna; +Cc: ext2-devel, linux-fsdevel, alex

On Fri, 4 Mar 2005 16:43:31 +0530
Suparna Bhattacharya <suparna@in.ibm.com> wrote:

> Alex, have you had a chance to prototype your idea of rooting extents
> in ea ?

I think all you need for this are:

1) allocate EA in ext3_new_inode()
2) write a replacement for ext3_init_tree_desc()
   just few lines of code
3) write .get_write_access and .mark_buffer_dirty methods
   again few lines
4) use replacement of ext3_init_tree_desc() in few places

thanks, Alex


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Ext2-devel] Re: Reviewing ext3 improvement patches (delalloc, mballoc, extents)
  2005-03-04 12:29     ` Alex Tomas
@ 2005-03-04 18:25       ` Andreas Dilger
  0 siblings, 0 replies; 24+ messages in thread
From: Andreas Dilger @ 2005-03-04 18:25 UTC (permalink / raw)
  To: Alex Tomas; +Cc: suparna, ext2-devel, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 1122 bytes --]

On Mar 04, 2005  15:29 +0300, Alex Tomas wrote:
> On Fri, 4 Mar 2005 16:43:31 +0530
> Suparna Bhattacharya <suparna@in.ibm.com> wrote:
> 
> > Alex, have you had a chance to prototype your idea of rooting extents
> > in ea ?
> 
> I think all you need for this are:
> 
> 1) allocate EA in ext3_new_inode()
> 2) write a replacement for ext3_init_tree_desc()
>    just few lines of code
> 3) write .get_write_access and .mark_buffer_dirty methods
>    again few lines
> 4) use replacement of ext3_init_tree_desc() in few places

This should of course only be done for large inodes.  Also, at some
point it will consume all of the EA space and we need to use an
external block.  It might help in some middle cases (i.e. files with
more extents than can fit in i_blocks (60 bytes), but less than fit
into the large inode space (128 or maybe 384 bytes)) but it might
also hurt other things if we need to allocate an EA block for another
EA...

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://members.shaw.ca/adilger/             http://members.shaw.ca/golinux/


[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Ext2-devel] Reviewing ext3 improvement patches (delalloc, mballoc, extents)
  2005-03-03  8:33 Reviewing ext3 improvement patches (delalloc, mballoc, extents) Suparna Bhattacharya
  2005-03-03  9:40 ` Andreas Dilger
@ 2005-03-04  1:12 ` Badari Pulavarty
  2005-03-04  1:46   ` Mingming Cao
                     ` (2 more replies)
  1 sibling, 3 replies; 24+ messages in thread
From: Badari Pulavarty @ 2005-03-04  1:12 UTC (permalink / raw)
  To: suparna; +Cc: ext2-devel, linux-fsdevel, Alex Tomas

On Thu, 2005-03-03 at 00:33, Suparna Bhattacharya wrote:
> Since the performance improvements seen so far are quite encouraging, 
> and momentum is picking up so well, I started looking through the
> patches from Alex ... just a quick code walkthrough to get a hang
> of it and think about what kind of simplifications might be possible
> and what it might take for inclusion.
> 
> I haven't had a chance to go too deep line by line yet,
> but thought I'd initiate some discussion with some first impressions
> and summary of what directions I hear several people converging
> towards to validate if I'm on the right track here.
> 
> diffstat of the 3 patches : 22 files changed, 5920 insertions(+), 
> 47 deletions. The largest is in the extents patch (2743), mballoc 
> is 1968, and delalloc is 1209. To use delalloc, which gives us
> all the performance benefits, right now we need all the 3 patches
> to be used in conjunction. Supporting extent map btrees as well 
> as traditional indexing and associated options for compatibility etc
> is perhaps the more invasive of changes. Given that keeping ext3 
> stable and maintainable is a key concern (that is after all a major 
> reason why a lot of users rely on ext3), a somewhat incremental 
> approach is desirable. 
> 
> So, I'll start from the direction that has been suggested by
> some -- (1) delayed allocation without changing the
> on-disk format. And then later (2) go on to breaking format with 
> all changes for scalability to larger files with full extents 
> support (haven't thought enough about this yet - maybe in a
> separate mail)
> 

Just doing delayed allocation without multiblock allocation
(with the current layout) is not really a useful thing, IMHO.
We will benifit few cases, but in general - we moved the
block allocation overhead from prepare write to writepages/writepage
time. There is a little benifit of not doing journaling twice etc..
but I don't think it would be enough to justify the effort. 
Isn't it ?

So, may be we should look at adding multiblock allocation +
delayed allocation to current ext3 layout. Then we can evaluate
the benifits of having "extents" etc and then break the layout ?

One more thing, we need to keep in mind is - we need to make sure
that "ordered" mode also improved - since all our testcode 
focuses on "writeback" mode and the default mode is "ordered" :(


Thanks,
Badari



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Ext2-devel] Reviewing ext3 improvement patches (delalloc, mballoc, extents)
  2005-03-04  1:12 ` [Ext2-devel] " Badari Pulavarty
@ 2005-03-04  1:46   ` Mingming Cao
  2005-03-04  3:26     ` Suparna Bhattacharya
  2005-03-14  8:36     ` Werner Almesberger
  2005-03-04 11:30   ` [Ext2-devel] " Alex Tomas
  2005-03-04 15:02   ` Alex Tomas
  2 siblings, 2 replies; 24+ messages in thread
From: Mingming Cao @ 2005-03-04  1:46 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: suparna, ext2-devel, linux-fsdevel, Alex Tomas

On Thu, 2005-03-03 at 17:12 -0800, Badari Pulavarty wrote:
> Just doing delayed allocation without multiblock allocation
> (with the current layout) is not really a useful thing, IMHO.
> We will benifit few cases, but in general - we moved the
> block allocation overhead from prepare write to writepages/writepage
> time. There is a little benifit of not doing journaling twice etc..
> but I don't think it would be enough to justify the effort. 
> Isn't it ?
> 

Hi Badari

I agree delayed allocation make much sense with multiblock allocation.
But I still think itself worth the effort, even without multiple block
allocation. If we have a seeky random write application, and if later
the application try to fill those holes, we normally will end up pretty
ugly file layout. With delayed allocation, we could have better chance
to get contigous blocks on disk for that file.

I happened found Ted has mentioned this before:
http://marc.theaimsgroup.com/?l=ext2-devel&m=107239591117758&w=2

> So, may be we should look at adding multiblock allocation +
> delayed allocation to current ext3 layout. Then we can evaluate
> the benifits of having "extents" etc and then break the layout ?
> 

Current reservation code could be improved to return back how big the
free chunk inside the window, and we could use that to help make
ext3_new_blocks()/ext3_get_blocks() happen.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Ext2-devel] Reviewing ext3 improvement patches (delalloc, mballoc, extents)
  2005-03-04  1:46   ` Mingming Cao
@ 2005-03-04  3:26     ` Suparna Bhattacharya
  2005-03-14  8:36     ` Werner Almesberger
  1 sibling, 0 replies; 24+ messages in thread
From: Suparna Bhattacharya @ 2005-03-04  3:26 UTC (permalink / raw)
  To: Mingming Cao
  Cc: Badari Pulavarty, ext2-devel, linux-fsdevel, Alex Tomas, akpm

On Thu, Mar 03, 2005 at 05:46:13PM -0800, Mingming Cao wrote:
> On Thu, 2005-03-03 at 17:12 -0800, Badari Pulavarty wrote:
> > Just doing delayed allocation without multiblock allocation
> > (with the current layout) is not really a useful thing, IMHO.
> > We will benifit few cases, but in general - we moved the
> > block allocation overhead from prepare write to writepages/writepage
> > time. There is a little benifit of not doing journaling twice etc..
> > but I don't think it would be enough to justify the effort. 
> > Isn't it ?
> > 
> 
> Hi Badari
> 
> I agree delayed allocation make much sense with multiblock allocation.
> But I still think itself worth the effort, even without multiple block
> allocation. If we have a seeky random write application, and if later
> the application try to fill those holes, we normally will end up pretty
> ugly file layout. With delayed allocation, we could have better chance
> to get contigous blocks on disk for that file.
> 
> I happened found Ted has mentioned this before:
> http://marc.theaimsgroup.com/?l=ext2-devel&m=107239591117758&w=2
> 
> > So, may be we should look at adding multiblock allocation +
> > delayed allocation to current ext3 layout. Then we can evaluate
> > the benifits of having "extents" etc and then break the layout ?
> > 
> 
> Current reservation code could be improved to return back how big the
> free chunk inside the window, and we could use that to help make
> ext3_new_blocks()/ext3_get_blocks() happen.

Yup this is exactly what I was thinking.

It'll probably only be a step along the way ... but I am hoping that
this will give us a direction to merge these pieces in incrementally, 
a little at a time, each piece being very well-understood and with 
demonstrated performance improvements at every step. For example, 
the next step after the following could be to plug parts of mballoc
in to the above, etc ... 

Does that make sense ?

Regards
Suparna


-- 
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Lab, India


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Ext2-devel] Reviewing ext3 improvement patches (delalloc, mballoc, extents)
  2005-03-04  1:46   ` Mingming Cao
  2005-03-04  3:26     ` Suparna Bhattacharya
@ 2005-03-14  8:36     ` Werner Almesberger
  2005-03-14  9:04       ` Suparna Bhattacharya
  1 sibling, 1 reply; 24+ messages in thread
From: Werner Almesberger @ 2005-03-14  8:36 UTC (permalink / raw)
  To: Mingming Cao
  Cc: Badari Pulavarty, suparna, ext2-devel, linux-fsdevel, Alex Tomas,
	abiss-general

Mingming Cao wrote:
> I agree delayed allocation make much sense with multiblock allocation.
> But I still think itself worth the effort, even without multiple block
> allocation.

On ABISS, we're currently also experimenting with delayed allocation.
There, the goal is less to improve overall performance, but to move
the accesses out of the synchronous code path for write(2).

The code works quite nicely for FAT and ext2, limiting the time it
takes to make a write call writing new data to about 4-6 ms on a
fairly sluggish machine (plus about 2-4 ms for moving the playout
point, which is a separate operation in ABISS), and with eight
competing best-effort writers who each enjoy write latencies of some
8 seconds, worst-case, overwriting old data.

Of course, this fails horribly on ext3, because it doesn't do anything
useful with the journal. Another problem is error handling. Since FAT
and ext2 don't have any form of reservation, a full disk isn't detected
until it's far too late.

So, a VFS-level reservation function would indeed be nice to have.

I looked at ext3 delalloc briefly, and while it did indeed improve
performance quite nicely, by being tied to ext3 internals, it would
be difficult to use in the framework of ABISS, where the code paths
are different (e.g. the prepare/commit functions should be as close
to no-ops as possible, and leave all the work to the prefetcher
thread), and which tries to be relatively file system independent.

- Werner

-- 
  _________________________________________________________________________
 / Werner Almesberger, Buenos Aires, Argentina     werner@almesberger.net /
/_http://www.almesberger.net/____________________________________________/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Ext2-devel] Reviewing ext3 improvement patches (delalloc, mballoc, extents)
  2005-03-14  8:36     ` Werner Almesberger
@ 2005-03-14  9:04       ` Suparna Bhattacharya
  2005-03-14 15:02         ` Werner Almesberger
  0 siblings, 1 reply; 24+ messages in thread
From: Suparna Bhattacharya @ 2005-03-14  9:04 UTC (permalink / raw)
  To: Werner Almesberger
  Cc: Mingming Cao, Badari Pulavarty, ext2-devel, linux-fsdevel,
	Alex Tomas, abiss-general

On Mon, Mar 14, 2005 at 05:36:58AM -0300, Werner Almesberger wrote:
> Mingming Cao wrote:
> > I agree delayed allocation make much sense with multiblock allocation.
> > But I still think itself worth the effort, even without multiple block
> > allocation.
> 
> On ABISS, we're currently also experimenting with delayed allocation.
> There, the goal is less to improve overall performance, but to move
> the accesses out of the synchronous code path for write(2).
> 
> The code works quite nicely for FAT and ext2, limiting the time it
> takes to make a write call writing new data to about 4-6 ms on a
> fairly sluggish machine (plus about 2-4 ms for moving the playout
> point, which is a separate operation in ABISS), and with eight
> competing best-effort writers who each enjoy write latencies of some
> 8 seconds, worst-case, overwriting old data.
> 
> Of course, this fails horribly on ext3, because it doesn't do anything
> useful with the journal. Another problem is error handling. Since FAT
> and ext2 don't have any form of reservation, a full disk isn't detected
> until it's far too late.
> 
> So, a VFS-level reservation function would indeed be nice to have.
> 
> I looked at ext3 delalloc briefly, and while it did indeed improve
> performance quite nicely, by being tied to ext3 internals, it would
> be difficult to use in the framework of ABISS, where the code paths
> are different (e.g. the prepare/commit functions should be as close
> to no-ops as possible, and leave all the work to the prefetcher
> thread), and which tries to be relatively file system independent.

I'm looking at whether we can do most of it at VFS level ... with
ext3 only taking care of the additional journalling bit - seems
quite feasible. There are two reqs (1) reservation (2) changing
mpage_writepages to use get_blocks(), which don't seem too hard.
ext3 ordered mode will need a bit more thought.

Of course, I haven't looked at how ABISS does delayed alloc -- 
do you have a patch snippet I can look at ?

Regards
Suparna

> 
> - Werner
> 
> -- 
>   _________________________________________________________________________
>  / Werner Almesberger, Buenos Aires, Argentina     werner@almesberger.net /
> /_http://www.almesberger.net/____________________________________________/
> 
> 
> -------------------------------------------------------
> SF email is sponsored by - The IT Product Guide
> Read honest & candid reviews on hundreds of IT Products from real users.
> Discover which products truly live up to the hype. Start reading now.
> http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
> _______________________________________________
> Ext2-devel mailing list
> Ext2-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/ext2-devel

-- 
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Lab, India


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Ext2-devel] Reviewing ext3 improvement patches (delalloc, mballoc, extents)
  2005-03-14  9:04       ` Suparna Bhattacharya
@ 2005-03-14 15:02         ` Werner Almesberger
  2005-03-14 15:43           ` Alex Tomas
  0 siblings, 1 reply; 24+ messages in thread
From: Werner Almesberger @ 2005-03-14 15:02 UTC (permalink / raw)
  To: Suparna Bhattacharya
  Cc: Mingming Cao, Badari Pulavarty, ext2-devel, linux-fsdevel,
	Alex Tomas, abiss-general

Suparna Bhattacharya wrote:
> I'm looking at whether we can do most of it at VFS level

Do you plan to reserve space as "blocks, somewhere", or as "these
specific on-disk locations" ? In ABISS, we did something of the
latter kind (in order to make large contiguous allocations also on
FAT), and it turned out to be a big mess, because ABISS needed too
much support from the file system driver. So we just scrapped that
bit :-)

> Of course, I haven't looked at how ABISS does delayed alloc -- 
> do you have a patch snippet I can look at ?

I just made a release. The kernel patch is in
abiss-7/kernel/abiss.patch  It's all in one big patch, sorry.
The main purpose of this is to see what we can achieve, so it's
not very polished.

The main parts: we added a new page flag, PG_delalloc, which
basically tells everyone to stay away from that page. There are
two purposes: (a) to make sure no allocation happens unless
explicitly requested, and (b) prevent the page from being written
back while it is still in ABISS' playout buffer. The reason for
(b) is that the page gets locked during writeback, which could
cause delays if the ABISS-using application then decides to
access the page.

The "hands off" code is mainly in fs/buffer.c, in the functions
__block_commit_write (set the page dirty, then go away),
cont_prepare_write (for FAT, do nothing),
block_prepare_write  (for ext2, do nothing),
and then fs/mpage.c:mpage_writepages (skip pages marked for
delayed allocation).

cont_prepare_write also needs to handle the special case where
it has to fill holes in a file. In this case, it simply overrides
delayed allocation. This bit will need more work.

Since ABISS prefetches pages, cont_prepare_write and
cont_prepare_write may now see pages that are already up to date,
so they must not zero them.

The prefetching happens in fs/abiss/sched_lib.c:abiss_read_page,
and writeback in abiss_put_page. We also experimented with
leaving the writeback to MM, but that led to OOM far too often.
The current solution works quite smoothly even if we tax the
system hard.

In order to keep things simple, I didn't try to make delayed
allocation do anything for writers that don't use ABISS.

The life cycle of a page is about as follows: when an application
reads or writes a file, ABISS maintains a playout buffer for it,
that typically reaches a few hundred kB ahead of the current file
position. Pages are prefetched and locked in the playout buffer.
The playout buffer is dimensioned that when file data enters the
playout buffer, there is enough time for the data to be in memory
by the time the application reaches it.

ABISS just calls readpage to get the data, which either causes it
to be read from disk, or the page to be zeroed, if we're beyond
EOF or at a hole.

The application accesses the page through the normal VFS functions,
so in the case of writing, the prepare/commit process happens.

Once the application has accessed the page, and moves the playout
buffer beyond it, the page is released and written back to disk.
Prefetching and writeback is done in a separate kernel thread, so
the application does not get delayed.

- Werner

-- 
  _________________________________________________________________________
 / Werner Almesberger, Buenos Aires, Argentina     werner@almesberger.net /
/_http://www.almesberger.net/____________________________________________/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Reviewing ext3 improvement patches (delalloc, mballoc, extents)
  2005-03-14 15:02         ` Werner Almesberger
@ 2005-03-14 15:43           ` Alex Tomas
  2005-03-14 16:37             ` [Ext2-devel] " Werner Almesberger
  0 siblings, 1 reply; 24+ messages in thread
From: Alex Tomas @ 2005-03-14 15:43 UTC (permalink / raw)
  To: Werner Almesberger
  Cc: Suparna Bhattacharya, Mingming Cao, Badari Pulavarty, ext2-devel,
	linux-fsdevel, Alex Tomas, abiss-general

>>>>> Werner Almesberger (WA) writes:

 WA> Do you plan to reserve space as "blocks, somewhere", or as "these
 WA> specific on-disk locations" ? In ABISS, we did something of the
 WA> latter kind (in order to make large contiguous allocations also on
 WA> FAT), and it turned out to be a big mess, because ABISS needed too
 WA> much support from the file system driver. So we just scrapped that
 WA> bit :-)

I see no reason to reserve specific block in ->prepare/->commit in
delayed allocation case. We already do this with reservation.
The sole point of delayed allocation is to allocate many blocks at once:
to minimize fragmentation, to decrease allocator involvement, to avoid
allocation at all if the file gets truncated quickly.

 WA> The main parts: we added a new page flag, PG_delalloc, which
 WA> basically tells everyone to stay away from that page. There are
 WA> two purposes: (a) to make sure no allocation happens unless
 WA> explicitly requested, and (b) prevent the page from being written
 WA> back while it is still in ABISS' playout buffer. The reason for
 WA> (b) is that the page gets locked during writeback, which could
 WA> cause delays if the ABISS-using application then decides to
 WA> access the page.

locked during writeback? PG_writeback should be used instead of PG_locked.

thanks, Alex

-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Ext2-devel] Reviewing ext3 improvement patches (delalloc, mballoc, extents)
  2005-03-14 15:43           ` Alex Tomas
@ 2005-03-14 16:37             ` Werner Almesberger
  2005-03-14 17:13               ` Alex Tomas
  2005-03-14 22:23               ` Bryan Henderson
  0 siblings, 2 replies; 24+ messages in thread
From: Werner Almesberger @ 2005-03-14 16:37 UTC (permalink / raw)
  To: Alex Tomas
  Cc: Suparna Bhattacharya, Mingming Cao, Badari Pulavarty, ext2-devel,
	linux-fsdevel, abiss-general

Alex Tomas wrote:
> I see no reason to reserve specific block in ->prepare/->commit in
> delayed allocation case. We already do this with reservation.

This seems like a sensible approach to me. Trying to reserve specific
blocks in an FS-independent way was what got us in trouble on ABISS.
So the plan B is to add this kind of reservation to where it is really
lacking (i.e. FAT).

Hmm, it's a bit confusing that we call both things "reservation".
Well, airlines do this too, "free seating".

> locked during writeback? PG_writeback should be used instead of PG_locked.

In mpage_writepages, writepage can also get called with the page just
PG_locked.

- Werner

-- 
  _________________________________________________________________________
 / Werner Almesberger, Buenos Aires, Argentina     werner@almesberger.net /
/_http://www.almesberger.net/____________________________________________/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Ext2-devel] Reviewing ext3 improvement patches (delalloc, mballoc, extents)
  2005-03-14 16:37             ` [Ext2-devel] " Werner Almesberger
@ 2005-03-14 17:13               ` Alex Tomas
  2005-03-15  0:28                 ` Werner Almesberger
  2005-03-14 22:23               ` Bryan Henderson
  1 sibling, 1 reply; 24+ messages in thread
From: Alex Tomas @ 2005-03-14 17:13 UTC (permalink / raw)
  To: Werner Almesberger
  Cc: Alex Tomas, Suparna Bhattacharya, Mingming Cao, Badari Pulavarty,
	ext2-devel, linux-fsdevel, abiss-general

>>>>> Werner Almesberger (WA) writes:

 >> locked during writeback? PG_writeback should be used instead of PG_locked.

 WA> In mpage_writepages, writepage can also get called with the page just
 WA> PG_locked.

you can drop PG_locked right as you set PG_writeback, I think

thanks, Alex


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Reviewing ext3 improvement patches (delalloc, mballoc, extents)
  2005-03-14 17:13               ` Alex Tomas
@ 2005-03-15  0:28                 ` Werner Almesberger
  0 siblings, 0 replies; 24+ messages in thread
From: Werner Almesberger @ 2005-03-15  0:28 UTC (permalink / raw)
  To: Alex Tomas
  Cc: Suparna Bhattacharya, Mingming Cao, Badari Pulavarty, ext2-devel,
	linux-fsdevel, abiss-general

Alex Tomas wrote:
> you can drop PG_locked right as you set PG_writeback, I think

Hmm, not sure. mpage_writepage never calls writepage with PG_writeback,
only with PG_locked. Also, mpage_writepage calls get_block with
PG_locked, so the allocation, which may take a while, holds the lock.

This situation is admittedly a bit annoying: on the one hand, "sync"
should write all dirty data. On the other hand, if a random user
typing "sync" can break performance guarantees, these guarantees
aren't very valuable.

- Werner

-- 
  _________________________________________________________________________
 / Werner Almesberger, Buenos Aires, Argentina     werner@almesberger.net /
/_http://www.almesberger.net/____________________________________________/


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Reviewing ext3 improvement patches (delalloc, mballoc, extents)
  2005-03-14 16:37             ` [Ext2-devel] " Werner Almesberger
  2005-03-14 17:13               ` Alex Tomas
@ 2005-03-14 22:23               ` Bryan Henderson
  2005-03-15  0:42                 ` Werner Almesberger
  1 sibling, 1 reply; 24+ messages in thread
From: Bryan Henderson @ 2005-03-14 22:23 UTC (permalink / raw)
  To: Werner Almesberger
  Cc: abiss-general, Alex Tomas, cmm, ext2-devel, linux-fsdevel,
	pbadari, Suparna Bhattacharya

>Hmm, it's a bit confusing that we call both things "reservation".

I think "reservation" is wrong for one of them and anyone using it that 
way should stop.  I believe the common terminology is:

- choosing the blocks is "placement."

- committing the required number of blocks from the resource pool for the 
instant use is "reservation."

- the combination of reservation and placement is "allocation."

Obviously, traditional filesystem drivers haven't split placement from 
reservation, so don't bother to use those terms.

Most delaying schemes delay the placement but not the reservation because 
they don't want to accept the possibility that a write would fail for lack 
of space after the write() system call succeeded.

Even in non-filesystem areas, "allocate" usually means to assign 
particular resources, while "reserve" just means to make arrangements so 
that a future allocate will succeed.  For example, if you know you need up 
to 10 blocks of memory to complete a task without deadlocking, but you 
don't know yet how exactly how many, you would reserve 10 blocks and 
later, if necessary, allocate the actual blocks.

--
Bryan Henderson                          IBM Almaden Research Center
San Jose CA                              Filesystems

-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Reviewing ext3 improvement patches (delalloc, mballoc, extents)
  2005-03-14 22:23               ` Bryan Henderson
@ 2005-03-15  0:42                 ` Werner Almesberger
  2005-03-15 21:59                   ` Bryan Henderson
  0 siblings, 1 reply; 24+ messages in thread
From: Werner Almesberger @ 2005-03-15  0:42 UTC (permalink / raw)
  To: Bryan Henderson
  Cc: abiss-general, Alex Tomas, cmm, ext2-devel, linux-fsdevel,
	pbadari, Suparna Bhattacharya

Bryan Henderson wrote:
> I think "reservation" is wrong for one of them and anyone using it that 
> way should stop.

Hehe, start with ext3 :-)

> I believe the common terminology is:

Sounds reasonable. The thing with "reservation" is that people use
it in daily life with all kinds of meanings, and often with the
object of the reservation, e.g. "reserve a seat" (typically a
specific seat), "reserve some time" (often not a specific interval),
or "reserve a table" (at a restaurant, you don't know which one,
but the restaurant staff does).

To muddy the issue further, reservations can be more or less firm.
E.g. if we "reserve" the next hundred blocks, so that allocation is
contiguous, we may want to be able to take them away if some other
file needs them. On the other hand, if storage is already committed,
but just not on disk yet, that reservation shouldn't be revokable.

- Werner

-- 
  _________________________________________________________________________
 / Werner Almesberger, Buenos Aires, Argentina     werner@almesberger.net /
/_http://www.almesberger.net/____________________________________________/

-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Reviewing ext3 improvement patches (delalloc, mballoc, extents)
  2005-03-15  0:42                 ` Werner Almesberger
@ 2005-03-15 21:59                   ` Bryan Henderson
  0 siblings, 0 replies; 24+ messages in thread
From: Bryan Henderson @ 2005-03-15 21:59 UTC (permalink / raw)
  To: Werner Almesberger
  Cc: abiss-general, Alex Tomas, cmm, ext2-devel, linux-fsdevel,
	pbadari, Suparna Bhattacharya

>Sounds reasonable. The thing with "reservation" is that people use
>it in daily life with all kinds of meanings,

That's the way it is all over.  Normal people are very sloppy in their 
language.  Engineers have to try to narrow the meanings of the common 
words to avoid totally confusing each other in these complex discussions.

But I think "reserve" in common usage is a lot less ambiguous than you 
say.  I believe when you reserve a seat on an airplane, most of the time 
it isn't a particular seat.  When it is, the airline will call it a "seat 
assignment" and you get it only after you turn your reservation into a 
purchased ticket.

I've never worked in a restaurant, but I've always assumed that when I 
make a reservation, even the restaurant doesn't know which table it is 
until I show up.  That way, it can load balance and give people choices 
when they come in.

>E.g. if we "reserve" the next hundred blocks, so that allocation is
>contiguous, we may want to be able to take them away if some other
>file needs them.

I would not call that a reservation.  I did, incidentally, design such a 
system once, and I called it "pencilled in."  I might also call it 
preliminary placement.

But I agree that reservations can be more or less firm, owing to the fact 
that sometimes they can be broken, with more or less ease.  E.g. you might 
reserve a megabyte of space for a file, and under pathological conditions 
still be told when you go to write that there's no space for you and 
you're screwed.  Just like you can get to the restaurant and be told 
there's no table for you.

-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Ext2-devel] Reviewing ext3 improvement patches (delalloc, mballoc, extents)
  2005-03-04  1:12 ` [Ext2-devel] " Badari Pulavarty
  2005-03-04  1:46   ` Mingming Cao
@ 2005-03-04 11:30   ` Alex Tomas
  2005-03-04 15:02   ` Alex Tomas
  2 siblings, 0 replies; 24+ messages in thread
From: Alex Tomas @ 2005-03-04 11:30 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: suparna, ext2-devel, linux-fsdevel, alex

On 03 Mar 2005 17:12:14 -0800
Badari Pulavarty <pbadari@us.ibm.com> wrote:

> Just doing delayed allocation without multiblock allocation
> (with the current layout) is not really a useful thing, IMHO.
> We will benifit few cases, but in general - we moved the
> block allocation overhead from prepare write to writepages/writepage
> time. There is a little benifit of not doing journaling twice etc..
> but I don't think it would be enough to justify the effort. 
> Isn't it ?

one more goodness - if file gets truncated soon, no allocation is needed at all

> One more thing, we need to keep in mind is - we need to make sure
> that "ordered" mode also improved - since all our testcode 
> focuses on "writeback" mode and the default mode is "ordered" :(

working on that.

thanks, Alex
 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Ext2-devel] Reviewing ext3 improvement patches (delalloc, mballoc, extents)
  2005-03-04  1:12 ` [Ext2-devel] " Badari Pulavarty
  2005-03-04  1:46   ` Mingming Cao
  2005-03-04 11:30   ` [Ext2-devel] " Alex Tomas
@ 2005-03-04 15:02   ` Alex Tomas
  2005-03-13 14:41     ` Delayed alloc for ordered-mode Suparna Bhattacharya
  2 siblings, 1 reply; 24+ messages in thread
From: Alex Tomas @ 2005-03-04 15:02 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: suparna, sct, akpmext2-devel, linux-fsdevel

On 03 Mar 2005 17:12:14 -0800
Badari Pulavarty <pbadari@us.ibm.com> wrote:

> One more thing, we need to keep in mind is - we need to make sure
> that "ordered" mode also improved - since all our testcode 
> focuses on "writeback" mode and the default mode is "ordered" :(
> 

I've just cooked the patch to implement ordered mode for delayed
allocation path. please take it:

ftp://ftp.clusterfs.com/pub/people/alex/2.6.11/ext3-delalloc-ordered-2.6.11-0.1.patch

Stephen, Andrew could you review it, please?

thanks, Alex


Index: linux-2.6.11/include/linux/jbd.h
===================================================================
--- linux-2.6.11.orig/include/linux/jbd.h	2005-03-02 20:49:13.000000000 +0300
+++ linux-2.6.11/include/linux/jbd.h	2005-03-04 17:03:52.000000000 +0300
@@ -486,6 +486,12 @@
 	struct journal_head	*t_sync_datalist;
 
 	/*
+	 * Number of BIO's submited in context of the transaction we
+	 * want to complete before committing
+	 */
+	 atomic_t		t_bios_in_flight;
+
+	/*
 	 * Doubly-linked circular list of all forget buffers (superseded
 	 * buffers which we can un-checkpoint once this transaction commits)
 	 * [j_list_lock]
@@ -678,6 +684,9 @@
 	/* Wait queue to wait for updates to complete */
 	wait_queue_head_t	j_wait_updates;
 
+	/* Wait queue to wait for all BIOs to complete */
+	wait_queue_head_t	j_wait_bios;
+
 	/* Semaphore for locking against concurrent checkpoints */
 	struct semaphore 	j_checkpoint_sem;
 
Index: linux-2.6.11/fs/jbd/commit.c
===================================================================
--- linux-2.6.11.orig/fs/jbd/commit.c	2005-03-02 20:49:09.000000000 +0300
+++ linux-2.6.11/fs/jbd/commit.c	2005-03-04 17:53:52.000000000 +0300
@@ -619,6 +620,13 @@
 	if (is_journal_aborted(journal))
 		goto skip_commit;
 
+	/*
+	 * Before the commit record, we have to wait for all bio's
+	 * ext3_wb_writepages() issued against newly-allocated blocks
+	 */
+	wait_event(journal->j_wait_bios, 
+		atomic_read(&commit_transaction->t_bios_in_flight) == 0);
+
 	/* Done it all: now write the commit record.  We should have
 	 * cleaned up our previous buffers by now, so if we are in abort
 	 * mode we can now just skip the rest of the journal write
Index: linux-2.6.11/fs/jbd/transaction.c
===================================================================
--- linux-2.6.11.orig/fs/jbd/transaction.c	2005-03-02 20:49:09.000000000 +0300
+++ linux-2.6.11/fs/jbd/transaction.c	2005-03-04 17:05:28.000000000 +0300
@@ -51,6 +51,7 @@
 	transaction->t_tid = journal->j_transaction_sequence++;
 	transaction->t_expires = jiffies + journal->j_commit_interval;
 	spin_lock_init(&transaction->t_handle_lock);
+	atomic_set(&transaction->t_bios_in_flight, 0);
 
 	/* Set up the commit timer for the new transaction. */
 	journal->j_commit_timer->expires = transaction->t_expires;
Index: linux-2.6.11/fs/jbd/journal.c
===================================================================
--- linux-2.6.11.orig/fs/jbd/journal.c	2005-03-04 17:04:29.000000000 +0300
+++ linux-2.6.11/fs/jbd/journal.c	2005-03-04 17:04:40.000000000 +0300
@@ -671,6 +671,7 @@
 	init_waitqueue_head(&journal->j_wait_checkpoint);
 	init_waitqueue_head(&journal->j_wait_commit);
 	init_waitqueue_head(&journal->j_wait_updates);
+	init_waitqueue_head(&journal->j_wait_bios);
 	init_MUTEX(&journal->j_barrier);
 	init_MUTEX(&journal->j_checkpoint_sem);
 	spin_lock_init(&journal->j_revoke_lock);
Index: linux-2.6.11/fs/ext3/writeback.c
===================================================================
--- linux-2.6.11.orig/fs/ext3/writeback.c	2005-03-04 15:10:01.000000000 +0300
+++ linux-2.6.11/fs/ext3/writeback.c	2005-03-04 17:33:05.000000000 +0300
@@ -145,6 +145,17 @@
 	if (bio->bi_size)
 		return 1;
 
+	if (bio->bi_private) {
+		transaction_t *transaction = bio->bi_private;
+
+		/* 
+		 * journal_commit_transaction() may be awaiting
+		 * the bio to complete.
+		 */
+		if (atomic_dec_and_test(&transaction->t_bios_in_flight))
+			wake_up(&transaction->t_journal->j_wait_bios);
+	}
+
 	do {
 		struct page *page = bvec->bv_page;
 
@@ -162,6 +173,16 @@
 static struct bio *ext3_wb_bio_submit(struct bio *bio, handle_t *handle)
 {
 	bio->bi_end_io = ext3_wb_end_io;
+	if (handle) {
+		/*
+		 * In data=ordered we shouldn't commit the transaction
+		 * until all data related to the transaction get on a
+		 * platter.
+		 */
+		atomic_inc(&handle->h_transaction->t_bios_in_flight);
+		bio->bi_private = handle->h_transaction;
+	} else
+		bio->bi_private = NULL;
 	submit_bio(WRITE, bio);
 	return NULL;
 }

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Delayed alloc for ordered-mode
  2005-03-04 15:02   ` Alex Tomas
@ 2005-03-13 14:41     ` Suparna Bhattacharya
  2005-03-13 19:32       ` Badari Pulavarty
  0 siblings, 1 reply; 24+ messages in thread
From: Suparna Bhattacharya @ 2005-03-13 14:41 UTC (permalink / raw)
  To: Alex Tomas; +Cc: Badari Pulavarty, sct, akpmext2-devel, linux-fsdevel


What would be really nice is if we could do this in a way that
enables reuse of generic paths even for ordered mode. One thought
that comes to mind is journal commit waiting for writeback to 
complete on the data pages which need to be flushed to disk before 
meta-data can be committed, much like we do for O_SYNC. 

I realise that JBD is intended to work at a level of abstraction
where it has no awareness of filesystems - hence the correspondence
with buffer heads all through. So would the above be a complete
no-no ?

Regards
Suparna

On Fri, Mar 04, 2005 at 06:02:35PM +0300, Alex Tomas wrote:
> On 03 Mar 2005 17:12:14 -0800
> Badari Pulavarty <pbadari@us.ibm.com> wrote:
> 
> > One more thing, we need to keep in mind is - we need to make sure
> > that "ordered" mode also improved - since all our testcode 
> > focuses on "writeback" mode and the default mode is "ordered" :(
> > 
> 
> I've just cooked the patch to implement ordered mode for delayed
> allocation path. please take it:
> 
> ftp://ftp.clusterfs.com/pub/people/alex/2.6.11/ext3-delalloc-ordered-2.6.11-0.1.patch
> 
> Stephen, Andrew could you review it, please?
> 
> thanks, Alex
> 
> 
> Index: linux-2.6.11/include/linux/jbd.h
> ===================================================================
> --- linux-2.6.11.orig/include/linux/jbd.h	2005-03-02 20:49:13.000000000 +0300
> +++ linux-2.6.11/include/linux/jbd.h	2005-03-04 17:03:52.000000000 +0300
> @@ -486,6 +486,12 @@
>  	struct journal_head	*t_sync_datalist;
>  
>  	/*
> +	 * Number of BIO's submited in context of the transaction we
> +	 * want to complete before committing
> +	 */
> +	 atomic_t		t_bios_in_flight;
> +
> +	/*
>  	 * Doubly-linked circular list of all forget buffers (superseded
>  	 * buffers which we can un-checkpoint once this transaction commits)
>  	 * [j_list_lock]
> @@ -678,6 +684,9 @@
>  	/* Wait queue to wait for updates to complete */
>  	wait_queue_head_t	j_wait_updates;
>  
> +	/* Wait queue to wait for all BIOs to complete */
> +	wait_queue_head_t	j_wait_bios;
> +
>  	/* Semaphore for locking against concurrent checkpoints */
>  	struct semaphore 	j_checkpoint_sem;
>  
> Index: linux-2.6.11/fs/jbd/commit.c
> ===================================================================
> --- linux-2.6.11.orig/fs/jbd/commit.c	2005-03-02 20:49:09.000000000 +0300
> +++ linux-2.6.11/fs/jbd/commit.c	2005-03-04 17:53:52.000000000 +0300
> @@ -619,6 +620,13 @@
>  	if (is_journal_aborted(journal))
>  		goto skip_commit;
>  
> +	/*
> +	 * Before the commit record, we have to wait for all bio's
> +	 * ext3_wb_writepages() issued against newly-allocated blocks
> +	 */
> +	wait_event(journal->j_wait_bios, 
> +		atomic_read(&commit_transaction->t_bios_in_flight) == 0);
> +
>  	/* Done it all: now write the commit record.  We should have
>  	 * cleaned up our previous buffers by now, so if we are in abort
>  	 * mode we can now just skip the rest of the journal write
> Index: linux-2.6.11/fs/jbd/transaction.c
> ===================================================================
> --- linux-2.6.11.orig/fs/jbd/transaction.c	2005-03-02 20:49:09.000000000 +0300
> +++ linux-2.6.11/fs/jbd/transaction.c	2005-03-04 17:05:28.000000000 +0300
> @@ -51,6 +51,7 @@
>  	transaction->t_tid = journal->j_transaction_sequence++;
>  	transaction->t_expires = jiffies + journal->j_commit_interval;
>  	spin_lock_init(&transaction->t_handle_lock);
> +	atomic_set(&transaction->t_bios_in_flight, 0);
>  
>  	/* Set up the commit timer for the new transaction. */
>  	journal->j_commit_timer->expires = transaction->t_expires;
> Index: linux-2.6.11/fs/jbd/journal.c
> ===================================================================
> --- linux-2.6.11.orig/fs/jbd/journal.c	2005-03-04 17:04:29.000000000 +0300
> +++ linux-2.6.11/fs/jbd/journal.c	2005-03-04 17:04:40.000000000 +0300
> @@ -671,6 +671,7 @@
>  	init_waitqueue_head(&journal->j_wait_checkpoint);
>  	init_waitqueue_head(&journal->j_wait_commit);
>  	init_waitqueue_head(&journal->j_wait_updates);
> +	init_waitqueue_head(&journal->j_wait_bios);
>  	init_MUTEX(&journal->j_barrier);
>  	init_MUTEX(&journal->j_checkpoint_sem);
>  	spin_lock_init(&journal->j_revoke_lock);
> Index: linux-2.6.11/fs/ext3/writeback.c
> ===================================================================
> --- linux-2.6.11.orig/fs/ext3/writeback.c	2005-03-04 15:10:01.000000000 +0300
> +++ linux-2.6.11/fs/ext3/writeback.c	2005-03-04 17:33:05.000000000 +0300
> @@ -145,6 +145,17 @@
>  	if (bio->bi_size)
>  		return 1;
>  
> +	if (bio->bi_private) {
> +		transaction_t *transaction = bio->bi_private;
> +
> +		/* 
> +		 * journal_commit_transaction() may be awaiting
> +		 * the bio to complete.
> +		 */
> +		if (atomic_dec_and_test(&transaction->t_bios_in_flight))
> +			wake_up(&transaction->t_journal->j_wait_bios);
> +	}
> +
>  	do {
>  		struct page *page = bvec->bv_page;
>  
> @@ -162,6 +173,16 @@
>  static struct bio *ext3_wb_bio_submit(struct bio *bio, handle_t *handle)
>  {
>  	bio->bi_end_io = ext3_wb_end_io;
> +	if (handle) {
> +		/*
> +		 * In data=ordered we shouldn't commit the transaction
> +		 * until all data related to the transaction get on a
> +		 * platter.
> +		 */
> +		atomic_inc(&handle->h_transaction->t_bios_in_flight);
> +		bio->bi_private = handle->h_transaction;
> +	} else
> +		bio->bi_private = NULL;
>  	submit_bio(WRITE, bio);
>  	return NULL;
>  }

-- 
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Lab, India


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Delayed alloc for ordered-mode
  2005-03-13 14:41     ` Delayed alloc for ordered-mode Suparna Bhattacharya
@ 2005-03-13 19:32       ` Badari Pulavarty
  0 siblings, 0 replies; 24+ messages in thread
From: Badari Pulavarty @ 2005-03-13 19:32 UTC (permalink / raw)
  To: suparna; +Cc: Alex Tomas, sct, akpm, linux-fsdevel

I think adding support to JBD to deal with "bio"s  would be
a valuable generic extention. Anyway its dealing with "bh"s
now.

Thanks,
Badari

Suparna Bhattacharya wrote:

> What would be really nice is if we could do this in a way that
> enables reuse of generic paths even for ordered mode. One thought
> that comes to mind is journal commit waiting for writeback to 
> complete on the data pages which need to be flushed to disk before 
> meta-data can be committed, much like we do for O_SYNC. 
> 
> I realise that JBD is intended to work at a level of abstraction
> where it has no awareness of filesystems - hence the correspondence
> with buffer heads all through. So would the above be a complete
> no-no ?
> 
> Regards
> Suparna
> 
> On Fri, Mar 04, 2005 at 06:02:35PM +0300, Alex Tomas wrote:
> 
>>On 03 Mar 2005 17:12:14 -0800
>>Badari Pulavarty <pbadari@us.ibm.com> wrote:
>>
>>
>>>One more thing, we need to keep in mind is - we need to make sure
>>>that "ordered" mode also improved - since all our testcode 
>>>focuses on "writeback" mode and the default mode is "ordered" :(
>>>
>>
>>I've just cooked the patch to implement ordered mode for delayed
>>allocation path. please take it:
>>
>>ftp://ftp.clusterfs.com/pub/people/alex/2.6.11/ext3-delalloc-ordered-2.6.11-0.1.patch
>>
>>Stephen, Andrew could you review it, please?
>>
>>thanks, Alex


^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2005-03-15 21:59 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-03-03  8:33 Reviewing ext3 improvement patches (delalloc, mballoc, extents) Suparna Bhattacharya
2005-03-03  9:40 ` Andreas Dilger
2005-03-03 22:10   ` Theodore Ts'o
2005-03-03 22:30     ` Alex Tomas
2005-03-04 11:13   ` Suparna Bhattacharya
2005-03-04 12:29     ` Alex Tomas
2005-03-04 18:25       ` [Ext2-devel] " Andreas Dilger
2005-03-04  1:12 ` [Ext2-devel] " Badari Pulavarty
2005-03-04  1:46   ` Mingming Cao
2005-03-04  3:26     ` Suparna Bhattacharya
2005-03-14  8:36     ` Werner Almesberger
2005-03-14  9:04       ` Suparna Bhattacharya
2005-03-14 15:02         ` Werner Almesberger
2005-03-14 15:43           ` Alex Tomas
2005-03-14 16:37             ` [Ext2-devel] " Werner Almesberger
2005-03-14 17:13               ` Alex Tomas
2005-03-15  0:28                 ` Werner Almesberger
2005-03-14 22:23               ` Bryan Henderson
2005-03-15  0:42                 ` Werner Almesberger
2005-03-15 21:59                   ` Bryan Henderson
2005-03-04 11:30   ` [Ext2-devel] " Alex Tomas
2005-03-04 15:02   ` Alex Tomas
2005-03-13 14:41     ` Delayed alloc for ordered-mode Suparna Bhattacharya
2005-03-13 19:32       ` Badari Pulavarty

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).