Lazy block allocation and block_prepare

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Lazy block allocation and block_prepare_write?
@ 2005-04-18  0:54 Martin Jambor
  2005-04-19  3:01 ` Badari Pulavarty
  0 siblings, 1 reply; 17+ messages in thread
From: Martin Jambor @ 2005-04-18  0:54 UTC (permalink / raw)
  To: linux-fsdevel

Hi all,

I am a member of a group that implements a filesystem that allocates
disk blocks to in-memory blocks lazily, that means, the decision is
made just before the data are actually sent to disk. Moreover, when
cached pages are modified, the data can be (and almost certainly will
be) written to a different place to from where it was read.

I was wondering, whether we could use the generic function
block_prepare_write at all. The function checks every buffer of the
page and if it is not mapped, it calls a fs supplied function that is
supposed to map the buffer, i.e. assign it a block on the device and
set its mapped flag.

This is where we would like to give an error if there is not enough
free disk space left but we cannot give a specific device block number
yet. Can we make one up, such as -1? What would that do to such dark
functions as unmap_underlying_metadata or any other? Would some other
part of kernel break if there was a bunch of buffers assigned to the
same spot on the disk?

On the other hand, if I understand buffer flags correctly, I need to
be able to emulate mapping of buffers to set them dirty, or em I
wrong?

Thanks for any insight or thoughts,

Martin Jambor

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Lazy block allocation and block_prepare_write?
  2005-04-18  0:54 Lazy block allocation and block_prepare_write? Martin Jambor
@ 2005-04-19  3:01 ` Badari Pulavarty
  2005-04-19 10:10   ` Alex Tomas
                     ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Badari Pulavarty @ 2005-04-19  3:01 UTC (permalink / raw)
  To: Martin Jambor; +Cc: linux-fsdevel

Martin Jambor wrote:

> Hi all,
> 
> I am a member of a group that implements a filesystem that allocates
> disk blocks to in-memory blocks lazily, that means, the decision is
> made just before the data are actually sent to disk. Moreover, when
> cached pages are modified, the data can be (and almost certainly will
> be) written to a different place to from where it was read.
> 
> I was wondering, whether we could use the generic function
> block_prepare_write at all. The function checks every buffer of the
> page and if it is not mapped, it calls a fs supplied function that is
> supposed to map the buffer, i.e. assign it a block on the device and
> set its mapped flag.
> 
> This is where we would like to give an error if there is not enough
> free disk space left but we cannot give a specific device block number
> yet. Can we make one up, such as -1? What would that do to such dark
> functions as unmap_underlying_metadata or any other? Would some other
> part of kernel break if there was a bunch of buffers assigned to the
> same spot on the disk?
> 
> On the other hand, if I understand buffer flags correctly, I need to
> be able to emulate mapping of buffers to set them dirty, or em I
> wrong?
> 
> Thanks for any insight or thoughts,

Yes. Its possible to do what you want to. I am currently working on
adding "delayed allocation" support to ext3. As part of that, We
are modifying generic helper routines to delay the allocation from
prepare time to actual writeout time. (writepage).

Here is the basic idea:
=======================

The idea is to "reserve" a block at the prepare/commit write instead
of allocating the block. Do the actual allocation in writepage().
Sounds simple :)

Here are the issues:
====================

1) Currently none of the generic helper routines can handle this.
We need to add support to do these, but still somehow make the
routines generic enough for every ones use.

2) There is no easy way to find out if we "reserved" a block or
not in writepage() correctly. There are 2 paths to writepage().

	sys_write() -> prepare/commit()
		and later sync() ----> writepage()

	mmap() -> touch a page()
		and later --> writepage()

In order to do the correct accounting, we need to mark a page
to indicate if we reserved a block or not. One way to do this,
to use page->private to indicate this. But then, all the generic
routines will fail - since they assume that page->private represents
bufferheads. So we need a better way to do this.

3) We need add hooks into filesystem specific calls from these
generic routines to handle "journaling mode" requirements
(for ext3 and may be others).

So, what are your requirements ?  I am looking for a common
way to combine all the requirements and come out with a
saner "generic" routines to handle these.

Thanks,
Badari

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Lazy block allocation and block_prepare_write?
  2005-04-19  3:01 ` Badari Pulavarty
@ 2005-04-19 10:10   ` Alex Tomas
  2005-04-19 14:48     ` Badari Pulavarty
  2005-04-19 11:22   ` Nikita Danilov
  2005-04-19 20:41   ` Martin Jambor
  2 siblings, 1 reply; 17+ messages in thread
From: Alex Tomas @ 2005-04-19 10:10 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: Martin Jambor, linux-fsdevel

>>>>> Badari Pulavarty (BP) writes:

 BP> In order to do the correct accounting, we need to mark a page
 BP> to indicate if we reserved a block or not. One way to do this,
 BP> to use page->private to indicate this. But then, all the generic
 BP> routines will fail - since they assume that page->private represents
 BP> bufferheads. So we need a better way to do this.

you can introduce one more bit to page->flags

 BP> 3) We need add hooks into filesystem specific calls from these
 BP> generic routines to handle "journaling mode" requirements
 BP> (for ext3 and may be others).

nobody uses journaling mode except ext3

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Lazy block allocation and block_prepare_write?
  2005-04-19 10:10   ` Alex Tomas
@ 2005-04-19 14:48     ` Badari Pulavarty
  2005-04-19 15:04       ` Alex Tomas
  0 siblings, 1 reply; 17+ messages in thread
From: Badari Pulavarty @ 2005-04-19 14:48 UTC (permalink / raw)
  To: Alex Tomas; +Cc: Martin Jambor, linux-fsdevel

On Tue, 2005-04-19 at 03:10, Alex Tomas wrote:
> >>>>> Badari Pulavarty (BP) writes:
> 
>  BP> In order to do the correct accounting, we need to mark a page
>  BP> to indicate if we reserved a block or not. One way to do this,
>  BP> to use page->private to indicate this. But then, all the generic
>  BP> routines will fail - since they assume that page->private represents
>  BP> bufferheads. So we need a better way to do this.
> 
> you can introduce one more bit to page->flags

Agreed. I was hoping to avoid it as much as I can.

> 
>  BP> 3) We need add hooks into filesystem specific calls from these
>  BP> generic routines to handle "journaling mode" requirements
>  BP> (for ext3 and may be others).
> 
> nobody uses journaling mode except ext3

What I meant by jounalling mode is that - after the pages are submitted
for IO, we need some way of waiting for the IOs to finish inorder to
guarantee the ordering ? Is this not needed for anything other than
ext3 ?

Thanks,
Badari


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Lazy block allocation and block_prepare_write?
  2005-04-19 14:48     ` Badari Pulavarty
@ 2005-04-19 15:04       ` Alex Tomas
  2005-04-19 15:00         ` Badari Pulavarty
  0 siblings, 1 reply; 17+ messages in thread
From: Alex Tomas @ 2005-04-19 15:04 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: Alex Tomas, Martin Jambor, linux-fsdevel

>>>>> Badari Pulavarty (BP) writes:

 >> you can introduce one more bit to page->flags

 BP> Agreed. I was hoping to avoid it as much as I can.

well, you're gonna modify mpage api anyway ...

 BP> What I meant by jounalling mode is that - after the pages are submitted
 BP> for IO, we need some way of waiting for the IOs to finish inorder to
 BP> guarantee the ordering ? Is this not needed for anything other than
 BP> ext3 ?

1) i'm not sure anyone else supports this
2) Andrew proposed the excelent solution

thanks, Alex

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Lazy block allocation and block_prepare_write?
  2005-04-19 15:04       ` Alex Tomas
@ 2005-04-19 15:00         ` Badari Pulavarty
  2005-04-19 15:20           ` Alex Tomas
  0 siblings, 1 reply; 17+ messages in thread
From: Badari Pulavarty @ 2005-04-19 15:00 UTC (permalink / raw)
  To: Alex Tomas; +Cc: Martin Jambor, linux-fsdevel

On Tue, 2005-04-19 at 08:04, Alex Tomas wrote:
> >>>>> Badari Pulavarty (BP) writes:
>  
>  >> you can introduce one more bit to page->flags
> 
>  BP> Agreed. I was hoping to avoid it as much as I can.
> 
> well, you're gonna modify mpage api anyway ...

Okay, I will give a serious look then. Last time, I tried to
go near page->flags I got slapped :( This time we have a
valid reason, I guess :)

> 
>  BP> What I meant by jounalling mode is that - after the pages are submitted
>  BP> for IO, we need some way of waiting for the IOs to finish inorder to
>  BP> guarantee the ordering ? Is this not needed for anything other than
>  BP> ext3 ?
> 
> 1) i'm not sure anyone else supports this

Fair enough. If no one needs it - lets keep the interface simple.

> 2) Andrew proposed the excelent solution

Well, I wasn't sure how heavy thats going to be. He was recommending
that we flush all dirty pages from all inodes for each transaction
commit. Isn't it ?

Thanks,
Badari


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Lazy block allocation and block_prepare_write?
  2005-04-19 15:00         ` Badari Pulavarty
@ 2005-04-19 15:20           ` Alex Tomas
  0 siblings, 0 replies; 17+ messages in thread
From: Alex Tomas @ 2005-04-19 15:20 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: Alex Tomas, Martin Jambor, linux-fsdevel

>>>>> Badari Pulavarty (BP) writes:

 >> 2) Andrew proposed the excelent solution

 BP> Well, I wasn't sure how heavy thats going to be. He was recommending
 BP> that we flush all dirty pages from all inodes for each transaction
 BP> commit. Isn't it ?

this is exactly what ext3 does being mounted with data=ordered
each page write(2) touches goes onto jbd list and commit thread
flushes them all. the only reason we can't use existing sync()
infrastructure is that we aren't permitted to touch metadata (in
our case, to allocate blocks) during commit. so, here one more
flag comes to wbc to signal sync to skip not-allocated-yet pages.
I like this a lot!

thanks, Alex

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Lazy block allocation and block_prepare_write?
  2005-04-19  3:01 ` Badari Pulavarty
  2005-04-19 10:10   ` Alex Tomas
@ 2005-04-19 11:22   ` Nikita Danilov
  2005-04-19 14:46     ` Badari Pulavarty
  2005-04-20  0:00     ` Bryan Henderson
  2005-04-19 20:41   ` Martin Jambor
  2 siblings, 2 replies; 17+ messages in thread
From: Nikita Danilov @ 2005-04-19 11:22 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: linux-fsdevel

Badari Pulavarty <pbadari@us.ibm.com> writes:

[...]

>
> Yes. Its possible to do what you want to. I am currently working on
> adding "delayed allocation" support to ext3. As part of that, We

As you most likely already know, Alex Thomas already implemented delayed
block allocation for ext3.

[...]

>
> In order to do the correct accounting, we need to mark a page
> to indicate if we reserved a block or not. One way to do this,
> to use page->private to indicate this. But then, all the generic

I believe one can use PG_mappedtodisk bit in page->flags for this
purpose. There was old Andrew Morton's patch that introduced new bit
(PG_delalloc?) for this purpose.

> routines will fail - since they assume that page->private represents
> bufferheads. So we need a better way to do this.

They are not generic then. Some file systems store things completely
different from buffer head ring in page->private.

>
> 3) We need add hooks into filesystem specific calls from these
> generic routines to handle "journaling mode" requirements
> (for ext3 and may be others).

Please don't. There is no such thing as "generic
journalling". Traditional WAL used by ext3, phase-trees of Tux2, and
wandering logs of reiser4 are so much different that there is no hope
for a single API to accommodate them all. Adding such API will only
force more workarounds and hacks in non-ext3 file systems.

What _is_ common to all journalling file systems on the other hand, is
the notion of transaction as the natural unit of caching and
write-out. Currently in Linux, write-out is inode-based
(->writepages()). Reiser4 already has a patch that replaces
sync_sb_inodes() function with super-block operation. In reiser4 case,
this operation scans the list of transactions (instead of the list of
inodes) and writes some of them out, which is natural thing to do for a
journalled file system.

Similarly, transaction is a unit of caching: it's often necessary to
scan all pages of a given transaction, all dirty pages of a given
transaction, or to check whether given page belongs to a given
transaction. That is, transaction plays role similar to struct
address_space. But currently there is 1-to-1 relation between inodes and
address_spaces, and this forces file system to implement additional data
structures to duplicate functionality already present in address_space.

>
> So, what are your requirements ?  I am looking for a common
> way to combine all the requirements and come out with a
> saner "generic" routines to handle these.
>

I think that one reasonable way to add generic support for journalling
is to split struct address_space into two objects: lower layer that
represents "file" (say, struct vm_file), in which pages are linearly
ordered, and on top of this vm_cache (representing transaction) that
keeps track of pages from various vm_file's. vm_file is embedded into
inode, and vm_cache has a pointer to (the analog of) struct
address_space_operations.

vm_cache's are created by file system back-end as necessary (can be
embedded into inode for non-journalled file systems). VM scanner and
balance_dirty_pages() call vm_cache operations to do write-out.

>
> Thanks,
> Badari

Nikita.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Lazy block allocation and block_prepare_write?
  2005-04-19 11:22   ` Nikita Danilov
@ 2005-04-19 14:46     ` Badari Pulavarty
  2005-04-19 15:55       ` Nikita Danilov
  2005-04-20  0:00     ` Bryan Henderson
  1 sibling, 1 reply; 17+ messages in thread
From: Badari Pulavarty @ 2005-04-19 14:46 UTC (permalink / raw)
  To: Nikita Danilov; +Cc: linux-fsdevel

On Tue, 2005-04-19 at 04:22, Nikita Danilov wrote:
> Badari Pulavarty <pbadari@us.ibm.com> writes:
> 
> [...]
> 
> >
> > Yes. Its possible to do what you want to. I am currently working on
> > adding "delayed allocation" support to ext3. As part of that, We
> 
> As you most likely already know, Alex Thomas already implemented delayed
> block allocation for ext3.

Yep. I reviewed Alex Thomas patches for delayed allocation. He handled
all the cases in his code and did NOT use any mpage* routines to do
the work. I was hoping to change the mpage infrastructure to handle
these, so that every filesystem doesn't have to do their thing.


> 
> >
> > In order to do the correct accounting, we need to mark a page
> > to indicate if we reserved a block or not. One way to do this,
> > to use page->private to indicate this. But then, all the generic
> 
> I believe one can use PG_mappedtodisk bit in page->flags for this
> purpose. There was old Andrew Morton's patch that introduced new bit
> (PG_delalloc?) for this purpose.

That would be good. But I don't feel like asking for a bit in page
if there is a way to get around it.

> 
> > routines will fail - since they assume that page->private represents
> > bufferheads. So we need a better way to do this.
> 
> They are not generic then. Some file systems store things completely
> different from buffer head ring in page->private.

Yep. Instead of changing the whole world, I was hoping to come up with
few common interfaces (which doesn't assume anything about bufferheads
etc..) which are useful for more than one filesystem.


> >
> > 3) We need add hooks into filesystem specific calls from these
> > generic routines to handle "journaling mode" requirements
> > (for ext3 and may be others).
> 
> Please don't. There is no such thing as "generic
> journalling". Traditional WAL used by ext3, phase-trees of Tux2, and
> wandering logs of reiser4 are so much different that there is no hope
> for a single API to accommodate them all. Adding such API will only
> force more workarounds and hacks in non-ext3 file systems.
> 
> What _is_ common to all journalling file systems on the other hand, is
> the notion of transaction as the natural unit of caching and
> write-out. Currently in Linux, write-out is inode-based
> (->writepages()). Reiser4 already has a patch that replaces
> sync_sb_inodes() function with super-block operation. In reiser4 case,
> this operation scans the list of transactions (instead of the list of
> inodes) and writes some of them out, which is natural thing to do for a
> journalled file system.
> 
> Similarly, transaction is a unit of caching: it's often necessary to
> scan all pages of a given transaction, all dirty pages of a given
> transaction, or to check whether given page belongs to a given
> transaction. That is, transaction plays role similar to struct
> address_space. But currently there is 1-to-1 relation between inodes and
> address_spaces, and this forces file system to implement additional data
> structures to duplicate functionality already present in address_space.
> >
> > So, what are your requirements ?  I am looking for a common
> > way to combine all the requirements and come out with a
> > saner "generic" routines to handle these.
> >
> 
> I think that one reasonable way to add generic support for journalling
> is to split struct address_space into two objects: lower layer that
> represents "file" (say, struct vm_file), in which pages are linearly
> ordered, and on top of this vm_cache (representing transaction) that
> keeps track of pages from various vm_file's. vm_file is embedded into
> inode, and vm_cache has a pointer to (the analog of) struct
> address_space_operations.
> 
> vm_cache's are created by file system back-end as necessary (can be
> embedded into inode for non-journalled file systems). VM scanner and
> balance_dirty_pages() call vm_cache operations to do write-out.

Need to think some more. I guess you thought about this more than you
do :)

Thanks,
Badari


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Lazy block allocation and block_prepare_write?
  2005-04-19 14:46     ` Badari Pulavarty
@ 2005-04-19 15:55       ` Nikita Danilov
  2005-04-19 16:06         ` Alex Tomas
  2005-04-19 17:08         ` Mingming Cao
  0 siblings, 2 replies; 17+ messages in thread
From: Nikita Danilov @ 2005-04-19 15:55 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: fsdevel

Badari Pulavarty <pbadari@us.ibm.com> writes:

> On Tue, 2005-04-19 at 04:22, Nikita Danilov wrote:
>> Badari Pulavarty <pbadari@us.ibm.com> writes:
>> 
>> [...]
>> 
>> >
>> > Yes. Its possible to do what you want to. I am currently working on
>> > adding "delayed allocation" support to ext3. As part of that, We
>> 
>> As you most likely already know, Alex Thomas already implemented delayed
>> block allocation for ext3.
>
> Yep. I reviewed Alex Thomas patches for delayed allocation. He handled
> all the cases in his code and did NOT use any mpage* routines to do
> the work. I was hoping to change the mpage infrastructure to handle
> these, so that every filesystem doesn't have to do their thing.
>

Just keep in mind that filesystem != ext3. :-) Generic support makes
sense only when it is usable by multiple file systems. This is not
always possible, e.g., there is no "generic block allocator" for
precisely the same reason: disk space allocation policies are tightly
intertwined with the rest of file system internals.

>
>> 
>> >
>> > In order to do the correct accounting, we need to mark a page
>> > to indicate if we reserved a block or not. One way to do this,
>> > to use page->private to indicate this. But then, all the generic
>> 
>> I believe one can use PG_mappedtodisk bit in page->flags for this
>> purpose. There was old Andrew Morton's patch that introduced new bit
>> (PG_delalloc?) for this purpose.
>
> That would be good. But I don't feel like asking for a bit in page
> if there is a way to get around it.

Clarification: PG_mappedtodisk is already here, it seems you can reuse
this already existing bit to implement delayed allocation support.

>

[...]

>> >
> Need to think some more. I guess you thought about this more than you
> do :)
>
> Thanks,
> Badari
>

Nikita.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Lazy block allocation and block_prepare_write?
  2005-04-19 15:55       ` Nikita Danilov
@ 2005-04-19 16:06         ` Alex Tomas
  2005-04-19 16:59           ` Badari Pulavarty
  2005-04-19 17:08         ` Mingming Cao
  1 sibling, 1 reply; 17+ messages in thread
From: Alex Tomas @ 2005-04-19 16:06 UTC (permalink / raw)
  To: Nikita Danilov; +Cc: Badari Pulavarty, fsdevel

>>>>> Nikita Danilov (ND) writes:

 >>> > In order to do the correct accounting, we need to mark a page
 >>> > to indicate if we reserved a block or not. One way to do this,
 >>> > to use page->private to indicate this. But then, all the generic
 >>> 
 >>> I believe one can use PG_mappedtodisk bit in page->flags for this
 >>> purpose. There was old Andrew Morton's patch that introduced new bit
 >>> (PG_delalloc?) for this purpose.
 >> 
 >> That would be good. But I don't feel like asking for a bit in page
 >> if there is a way to get around it.

 ND> Clarification: PG_mappedtodisk is already here, it seems you can reuse
 ND> this already existing bit to implement delayed allocation support.

I think we need another one, because mappedtodisk != reserved. we could use
mappedtodisk, but this means in ->commit_write() we'd need to check that one
more time (first time in ->prepare_write())

thanks, Alex


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Lazy block allocation and block_prepare_write?
  2005-04-19 16:06         ` Alex Tomas
@ 2005-04-19 16:59           ` Badari Pulavarty
  0 siblings, 0 replies; 17+ messages in thread
From: Badari Pulavarty @ 2005-04-19 16:59 UTC (permalink / raw)
  To: Alex Tomas; +Cc: Nikita Danilov, fsdevel

Alex Tomas wrote:
>>>>>>Nikita Danilov (ND) writes:
> 
> 
>  >>> > In order to do the correct accounting, we need to mark a page
>  >>> > to indicate if we reserved a block or not. One way to do this,
>  >>> > to use page->private to indicate this. But then, all the generic
>  >>> 
>  >>> I believe one can use PG_mappedtodisk bit in page->flags for this
>  >>> purpose. There was old Andrew Morton's patch that introduced new bit
>  >>> (PG_delalloc?) for this purpose.
>  >> 
>  >> That would be good. But I don't feel like asking for a bit in page
>  >> if there is a way to get around it.
> 
>  ND> Clarification: PG_mappedtodisk is already here, it seems you can reuse
>  ND> this already existing bit to implement delayed allocation support.
> 
> I think we need another one, because mappedtodisk != reserved. we could use
> mappedtodisk, but this means in ->commit_write() we'd need to check that one
> more time (first time in ->prepare_write())

Yep. We need one more to indicate the we reserved a block for this page.

Other option I was thinking on how to avoid is, by "reserving" a block
when a mapped page changes from read -> write. Andrew's -mm tree has
patch to give us a notification when it happens.

Thanks,
Badari


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Lazy block allocation and block_prepare_write?
  2005-04-19 15:55       ` Nikita Danilov
  2005-04-19 16:06         ` Alex Tomas
@ 2005-04-19 17:08         ` Mingming Cao
  2005-04-19 18:45           ` Nikita Danilov
  1 sibling, 1 reply; 17+ messages in thread
From: Mingming Cao @ 2005-04-19 17:08 UTC (permalink / raw)
  To: Nikita Danilov; +Cc: Badari Pulavarty, fsdevel

On Tue, 2005-04-19 at 19:55 +0400, Nikita Danilov wrote:
> Badari Pulavarty <pbadari@us.ibm.com> writes:
> 
> > On Tue, 2005-04-19 at 04:22, Nikita Danilov wrote:
> >> Badari Pulavarty <pbadari@us.ibm.com> writes:
> >> 
> >> [...]
> >> 
> >> >
> >> > Yes. Its possible to do what you want to. I am currently working on
> >> > adding "delayed allocation" support to ext3. As part of that, We
> >> 
> >> As you most likely already know, Alex Thomas already implemented delayed
> >> block allocation for ext3.
> >
> > Yep. I reviewed Alex Thomas patches for delayed allocation. He handled
> > all the cases in his code and did NOT use any mpage* routines to do
> > the work. I was hoping to change the mpage infrastructure to handle
> > these, so that every filesystem doesn't have to do their thing.
> >
> 
> Just keep in mind that filesystem != ext3. :-) Generic support makes
> sense only when it is usable by multiple file systems. This is not
> always possible, e.g., there is no "generic block allocator" for
> precisely the same reason: disk space allocation policies are tightly
> intertwined with the rest of file system internals.
> 

This generic support should be useful for ext2 and xfs. From delayed
allocation point of view, it should not aware any filesystem specific
block allocation policies, and it should not care.:)  It just simply
gathering all pages that need to map block on disk, and asking the
filesystem get_blocks() call back function, which will take care of the
filesystem-specific multiple blocks mapping for it.

Current get_blocks() function for ext3 is just simply loop calling
ext3_get_block().  I am trying to add a real ext3_get_blocks() to reduce
the cpu cost, reduce the number of metadata updates and increase the
possibility to get contiguous blocks on disk.



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Lazy block allocation and block_prepare_write?
  2005-04-19 17:08         ` Mingming Cao
@ 2005-04-19 18:45           ` Nikita Danilov
  0 siblings, 0 replies; 17+ messages in thread
From: Nikita Danilov @ 2005-04-19 18:45 UTC (permalink / raw)
  To: fsdevel; +Cc: Mingming Cao, "writes:"

Mingming Cao <cmm@us.ibm.com> writes:

> On Tue, 2005-04-19 at 19:55 +0400, Nikita Danilov wrote:
>> Badari Pulavarty <pbadari@us.ibm.com> writes:
>> 
>> > On Tue, 2005-04-19 at 04:22, Nikita Danilov wrote:
>> >> Badari Pulavarty <pbadari@us.ibm.com> writes:
>> >> 
>> >> [...]
>> >> 
>> >> >
>> >> > Yes. Its possible to do what you want to. I am currently working on
>> >> > adding "delayed allocation" support to ext3. As part of that, We
>> >> 
>> >> As you most likely already know, Alex Thomas already implemented delayed
>> >> block allocation for ext3.
>> >
>> > Yep. I reviewed Alex Thomas patches for delayed allocation. He handled
>> > all the cases in his code and did NOT use any mpage* routines to do
>> > the work. I was hoping to change the mpage infrastructure to handle
>> > these, so that every filesystem doesn't have to do their thing.
>> >
>> 
>> Just keep in mind that filesystem != ext3. :-) Generic support makes
>> sense only when it is usable by multiple file systems. This is not
>> always possible, e.g., there is no "generic block allocator" for
>> precisely the same reason: disk space allocation policies are tightly
>> intertwined with the rest of file system internals.
>> 
>
> This generic support should be useful for ext2 and xfs. From delayed

But it won't work for reiser4, that allocates blocks _across_ multiple
files. E.g., if many files were created in the same directory,
allocation (performed just before write-out) will assign block numbers
so that files are ordered according to the readdir order on the disk
(with each file body being an interval in that ordering). This is done
by arranging all dirty blocks of a given transaction according to some
"ideal" ordering and then trying to map this ordering onto disk blocks.

As you see, in this case allocation is not done on inode-by-inode basis
at all: instead delayed allocation is done at the transaction level of
granularity, and I am trying to point out that this is natural thing for
the journalled file system to do.

The same goes for write-out: in ext3 there is only one "active"
transaction at any moment, and this means that ->writepages() calls can
go in arbitrary order, but for the file system type with multiple active
transactions that can be committed separately, order of ->writepages()
calls has to follow ordering between transactions. Again, this means
that write-out should be transaction rather than inode based.

If we want really generic support for journalling and
delayed-allocation, mpage_* functions are the wrong level. Instead
proper notion of transaction has to be introduced, and file system IO
and disk space allocation interfaces adjusted appropriately.

Nikita.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Lazy block allocation and block_prepare_write?
  2005-04-19 11:22   ` Nikita Danilov
  2005-04-19 14:46     ` Badari Pulavarty
@ 2005-04-20  0:00     ` Bryan Henderson
  1 sibling, 0 replies; 17+ messages in thread
From: Bryan Henderson @ 2005-04-20  0:00 UTC (permalink / raw)
  To: Nikita Danilov; +Cc: linux-fsdevel, pbadari

>> routines will fail - since they assume that page->private represents
>> bufferheads. So we need a better way to do this.
>
>They are not generic then. Some file systems store things completely
>different from buffer head ring in page->private.

I've seen these instances (and worked around them because I maintain 
filesystem code that does in fact use private pages but not use the buffer 
cache to manage them).  I've always assumed they're just errors -- corners 
that were cut in the original project to abstract out the buffer cache. 
Anyone who has a problem with them should just fix them.

>I think that one reasonable way to add generic support for journalling
>is to split struct address_space into two objects: lower layer that
>represents "file" (say, struct vm_file), in which pages are linearly
>ordered, and on top of this vm_cache (representing transaction) that
>keeps track of pages from various vm_file's. vm_file is embedded into
>inode, and vm_cache has a pointer to (the analog of) struct
>address_space_operations.
>
>vm_cache's are created by file system back-end as necessary (can be
>embedded into inode for non-journalled file systems). VM scanner and
>balance_dirty_pages() call vm_cache operations to do write-out.

That looks entirely reasonable to me, but should be combined with 
divorcing address spaces from files.  An address space (or the "lower 
level" above) should be a simple virtual memory object, managed by the 
virtual memory manager.  It can be used for a file data cache, but also 
for anything else you want to participate in system memory management / 
page replacement.

We're already practically there.  Address spaces are tied to files only in 
these ways:

  1) The code is in the fs/ directory.  It needs to be be in mm/ .

  2) The "host" field is a struct inode *.  It needs to be void *.

  3) In a handful of places (and they keep moving), memory manager 
     code dereferences 'host' and looks in the inode.  I know these 
     are trivial connections, because I work around them by supplying
     a dummy inode (and sometimes a dummy superblock) with a few 
     fields filled in.

(Incidentally, _I_ am actually using address spaces for file caches; I 
just can't tie them to the files in the traditional way; the cache exists 
even when there are no inodes for the file).

--
Bryan Henderson                          IBM Almaden Research Center
San Jose CA                              Filesystems

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Lazy block allocation and block_prepare_write?
  2005-04-19  3:01 ` Badari Pulavarty
  2005-04-19 10:10   ` Alex Tomas
  2005-04-19 11:22   ` Nikita Danilov
@ 2005-04-19 20:41   ` Martin Jambor
  2005-04-20 14:52     ` Badari Pulavarty
  2 siblings, 1 reply; 17+ messages in thread
From: Martin Jambor @ 2005-04-19 20:41 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: linux-fsdevel

Hi,

I see the scope  of the discussion here got quickly beyond the scope
of my first posting :-) Anyway, the filesystem we're implementing is a
variant of a classic log-structured filesystem which is quite similiar
to unix filesystems in many aspects (like inodes and stuff) and we
will have 0-1 (sort of) transactions so as far as this issue is
concerned our case is probably very similiar to ext3 delayed
allocation.

On 4/19/05, Badari Pulavarty <pbadari@us.ibm.com> wrote:
> The idea is to "reserve" a block at the prepare/commit write instead
> of allocating the block. Do the actual allocation in writepage().

Exactly.

> Here are the issues:
> ====================
> 
> 1) Currently none of the generic helper routines can handle this.
> We need to add support to do these, but still somehow make the
> routines generic enough for every ones use.

I'm quite happy about most of them. I can't see how we could use any
generic form of writepage(s) as we write stuff in a quite different
way from almost anybody else but all the others except
block_prepare_write do  pretty much exactly what we need (if I have
not missed something).

> 2) There is no easy way to find out if we "reserved" a block or
> not in writepage() correctly. There are 2 paths to writepage().
> 
>         sys_write() -> prepare/commit()
>                 and later sync() ----> writepage()
> 
>         mmap() -> touch a page()
>                 and later --> writepage()
> 
> In order to do the correct accounting, we need to mark a page
> to indicate if we reserved a block or not. One way to do this,
> to use page->private to indicate this. But then, all the generic
> routines will fail - since they assume that page->private represents
> bufferheads. So we need a better way to do this.

I didn't hope for a special bit in struct page so I wanted to simply
fake the page/buffer mapping somehow. Since we don't really care
whether a page is mapped or reserved as long as it is at least one of
these when actually writing it (we write stuff to different places
from where we have read it from), the PG_mappedtodisk is fine for us
as long as no other kernel code thinks that having it set means we
also have buffers which point to meaningful positions on the device
because we don't. Is that the case?

Of course, having a PG_RESERVED flag would be a nice and clean thing
to use and we would be more than happy to do so.

> 3) We need add hooks into filesystem specific calls from these
> generic routines to handle "journaling mode" requirements

Our fs is basically one big journal so we don't need any of these. Or
at least I don't see any need for it at the moment.

> So, what are your requirements ?  I am looking for a common
> way to combine all the requirements and come out with a
> saner "generic" routines to handle these.

I'm happy with most generic functions. we need to implement
writepage(s) ourselves no matter what, the only problem is
block_prepare_write and I can currently only see two options for us:

1) Implement it ourselves and use a flag in the struct page to mark it reserved.

2) Use block_prepare_write but enable the get_block function to mark
an individual buffer as reserved so that it is trated as mapped (can
be dirty and stuff) but no code assumes it is located somewhere on the
disk (for example block_prepare_write would not call
unmap_underlying_metadata).

I think we'll go for the first method, but the second would make life
easier for filesystems which can have pages consisting of both mapped
and reserved blocks.

Thank you very much for your reply, the whole thread has been well
worth reading.

Martin Jambor

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Lazy block allocation and block_prepare_write?
  2005-04-19 20:41   ` Martin Jambor
@ 2005-04-20 14:52     ` Badari Pulavarty
  0 siblings, 0 replies; 17+ messages in thread
From: Badari Pulavarty @ 2005-04-20 14:52 UTC (permalink / raw)
  To: Martin Jambor, Nikita Danilov, Bryan Henderson; +Cc: linux-fsdevel

On Tue, 2005-04-19 at 13:41, Martin Jambor wrote:

> > 
> > 1) Currently none of the generic helper routines can handle this.
> > We need to add support to do these, but still somehow make the
> > routines generic enough for every ones use.
> 
> I'm quite happy about most of them. I can't see how we could use any
> generic form of writepage(s) as we write stuff in a quite different
> way from almost anybody else but all the others except
> block_prepare_write do  pretty much exactly what we need (if I have
> not missed something).

Yep. You need to have your own block_prepare_write() which doesn't
allocate a block - instead it "reserves" a block.

> 
> > 2) There is no easy way to find out if we "reserved" a block or
> > not in writepage() correctly. There are 2 paths to writepage().
> > 
> >         sys_write() -> prepare/commit()
> >                 and later sync() ----> writepage()
> > 
> >         mmap() -> touch a page()
> >                 and later --> writepage()
> > 
> > In order to do the correct accounting, we need to mark a page
> > to indicate if we reserved a block or not. One way to do this,
> > to use page->private to indicate this. But then, all the generic
> > routines will fail - since they assume that page->private represents
> > bufferheads. So we need a better way to do this.
> 
> I didn't hope for a special bit in struct page so I wanted to simply
> fake the page/buffer mapping somehow. Since we don't really care
> whether a page is mapped or reserved as long as it is at least one of
> these when actually writing it (we write stuff to different places
> from where we have read it from), the PG_mappedtodisk is fine for us
> as long as no other kernel code thinks that having it set means we
> also have buffers which point to meaningful positions on the device
> because we don't. Is that the case?
> 
> Of course, having a PG_RESERVED flag would be a nice and clean thing
> to use and we would be more than happy to do so.
> 
> > 3) We need add hooks into filesystem specific calls from these
> > generic routines to handle "journaling mode" requirements
> 
> Our fs is basically one big journal so we don't need any of these. Or
> at least I don't see any need for it at the moment.
> 
> > So, what are your requirements ?  I am looking for a common
> > way to combine all the requirements and come out with a
> > saner "generic" routines to handle these.
> 
> I'm happy with most generic functions. we need to implement
> writepage(s) ourselves no matter what, the only problem is
> block_prepare_write and I can currently only see two options for us:
> 
> 1) Implement it ourselves and use a flag in the struct page to mark it reserved.
> 
> 2) Use block_prepare_write but enable the get_block function to mark
> an individual buffer as reserved so that it is trated as mapped (can
> be dirty and stuff) but no code assumes it is located somewhere on the
> disk (for example block_prepare_write would not call
> unmap_underlying_metadata).
> 
> I think we'll go for the first method, but the second would make life
> easier for filesystems which can have pages consisting of both mapped
> and reserved blocks.

I guess for now, you can do all this in your filesystem specific code.
Too bad, every one has to write their own. Hopefully, we cleanup all
these interfaces and provide a generic *enough* interfaces to do this -
so everyone can use it.

As you can see, lots of issues need to be worked out + I don't have
bandwidth to undertake such an effort, unless Nikita and Bryan
are willing to signup for this :)

Thanks,
Badari



^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2005-04-20 15:04 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-04-18  0:54 Lazy block allocation and block_prepare_write? Martin Jambor
2005-04-19  3:01 ` Badari Pulavarty
2005-04-19 10:10   ` Alex Tomas
2005-04-19 14:48     ` Badari Pulavarty
2005-04-19 15:04       ` Alex Tomas
2005-04-19 15:00         ` Badari Pulavarty
2005-04-19 15:20           ` Alex Tomas
2005-04-19 11:22   ` Nikita Danilov
2005-04-19 14:46     ` Badari Pulavarty
2005-04-19 15:55       ` Nikita Danilov
2005-04-19 16:06         ` Alex Tomas
2005-04-19 16:59           ` Badari Pulavarty
2005-04-19 17:08         ` Mingming Cao
2005-04-19 18:45           ` Nikita Danilov
2005-04-20  0:00     ` Bryan Henderson
2005-04-19 20:41   ` Martin Jambor
2005-04-20 14:52     ` Badari Pulavarty

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).