[RFC] generic IO write clustering

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC] generic IO write clustering
@ 2001-01-20  0:34 Marcelo Tosatti
  2001-01-20  2:58 ` Rik van Riel
  2001-01-20 15:57 ` Christoph Hellwig
  0 siblings, 2 replies; 9+ messages in thread
From: Marcelo Tosatti @ 2001-01-20  0:34 UTC (permalink / raw)
  To: linux-kernel; +Cc: Rajagopal Ananthanarayanan, Rik van Riel, Stephen C. Tweedie

Hi, 

I'm starting to implement a generic write clustering scheme and I would
like to receive comments and suggestions.

The write clustering issue has already been discussed (mainly at Miami)
and the agreement, AFAIK, was to implement the write clustering at the
per-address-space writepage() operation.

IMO there are some problems if we implement the write clustering in this
level:

  - The filesystem does not have information (and should not have) about
    limiting cluster size depending on memory shortage.
  - By doing the write clustering at a higher level, we avoid a ton of
    filesystems duplicating the code.

So what I suggest is to add a "cluster" operation to struct address_space
which can be used by the VM code to know the optimal IO transfer unit in
the storage device. Something like this (maybe we need an async flag but
thats a minor detail now):

        int (*cluster)(struct page *, unsigned long *boffset, 
		unsigned long *poffset);

"page" is from where the filesystem code should start its search for
contiguous pages. boffset and poffset are passed by the VM code to know
the logical "backwards offset" (number of contiguous pages going backwards
from "page") and "forward offset" (cont pages going forward from
"page") in the inode.

The idea is to work with delayed allocated pages, too. A filesystem which
has this feature can, at its "cluster" operation, allocate delayed pages
contiguously on disk, and then return to the VM code which now can
potentially write a bunch of dirty pages in a few big IO operations.

I'm sure that a bit of tuning to know the optimal cluster size will be
needed. Also some fs locking problems will appear.

But it seems worth to me since we're avoiding a lot of code replication in
the future and also the performance gain will be _nice_.

Comments, thoughts?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] generic IO write clustering
  2001-01-20  2:58 ` Rik van Riel
@ 2001-01-20  1:52   ` Marcelo Tosatti
  0 siblings, 0 replies; 9+ messages in thread
From: Marcelo Tosatti @ 2001-01-20  1:52 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, Rajagopal Ananthanarayanan, Stephen C. Tweedie


On Sat, 20 Jan 2001, Rik van Riel wrote:

> Is there ever a reason NOT to do the best possible IO
> clustering at write time ?
> 
> Remember that disk writes do not cost memory and have
> no influence on the resident set ... completely unlike
> read clustering, which does need to be limited.

You dont want to have too many ongoing writes at the same time to avoid
complete starvation of the system. We already do this, and have to, in a
quite a few places.

> >   - By doing the write clustering at a higher level, we avoid a ton of
> >     filesystems duplicating the code.
> >
> > So what I suggest is to add a "cluster" operation to struct address_space
> > which can be used by the VM code to know the optimal IO transfer unit in
> > the storage device. Something like this (maybe we need an async flag but
> > thats a minor detail now):
> >
> >         int (*cluster)(struct page *, unsigned long *boffset,
> > 		unsigned long *poffset);
> 
> Makes sense, except that I don't see how (or why) the _VM_
> should "know the optimal IO transfer unit". This sounds more
> like a job for the IO subsystem and/or the filesystem, IMHO.

The a_ops->cluster() operation will make the VM aware of the contiguous
pages which can be clustered.

The VM does not know about _any_ fs lowlevel details (which are hidden
behind ->cluster()), including buffer_head's.

> 
> > "page" is from where the filesystem code should start its search
> > for contiguous pages. boffset and poffset are passed by the VM
> > code to know the logical "backwards offset" (number of
> > contiguous pages going backwards from "page") and "forward
> > offset" (cont pages going forward from "page") in the inode.
> 
> Yes, this makes a LOT of sense. I really like a pagecache
> helper function so the filesystems can build their writeout
> clusters easier.

The address space owners (filesystems _and_ swap for this case) do not
need to implement the writeout clustering at all because we're doing it at
the VM _without_ having to know about low-level details.

Take a look at this somewhat pseudo-code:

int cluster_write (struct page *page)
{
        struct address_space *mapping = page->mapping;
	unsigned long boffset, poffset;
        int nr_pages;

	...
	/* How much pages can we write for free? */
        nr_pages = mapping->a_ops->cluster(page, &boffset, &poffset);
	...	

	page_cluster_flush(page, csize); 
}

/*
 * @page: dirty page from where to start the search
 * @csize: maximum size of the cluster
 */
int page_cluster_flush(struct page *page, int csize)
{
        struct *cpages[csize];
        struct address_space *mapping = page->mapping;
        struct inode *inode = mapping->host;
        unsigned long end_index = inode->i_size >> PAGE_CACHE_SHIFT;
        unsigned long index = page->index;
        unsigned long curr_index = page->index;

	cpages[csize] = page;
	count = 1;

	/* Search for clusterable dirty pages behind */
	....
	/* Search for clusterable dirty pages ahead */
	...
	/* Write all of them */
	for(i=0; i<count; i++) {
		ClearPageDirty(cpages[i]);
		writepage(cpages[i]);  
	...
}

This way we have _one_ clean implementation of write clustering without
any lowlevel crap involved. Try to imagine the amount of code people will
manage to write in their own fs's to implement write clustering.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] generic IO write clustering
  2001-01-20  0:34 [RFC] generic IO write clustering Marcelo Tosatti
@ 2001-01-20  2:58 ` Rik van Riel
  2001-01-20  1:52   ` Marcelo Tosatti
  2001-01-20 15:57 ` Christoph Hellwig
  1 sibling, 1 reply; 9+ messages in thread
From: Rik van Riel @ 2001-01-20  2:58 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: linux-kernel, Rajagopal Ananthanarayanan, Stephen C. Tweedie

On Fri, 19 Jan 2001, Marcelo Tosatti wrote:

> The write clustering issue has already been discussed (mainly at Miami)
> and the agreement, AFAIK, was to implement the write clustering at the
> per-address-space writepage() operation.
>
> IMO there are some problems if we implement the write clustering in this
> level:
>
>   - The filesystem does not have information (and should not have) about
>     limiting cluster size depending on memory shortage.

Is there ever a reason NOT to do the best possible IO
clustering at write time ?

Remember that disk writes do not cost memory and have
no influence on the resident set ... completely unlike
read clustering, which does need to be limited.

>   - By doing the write clustering at a higher level, we avoid a ton of
>     filesystems duplicating the code.
>
> So what I suggest is to add a "cluster" operation to struct address_space
> which can be used by the VM code to know the optimal IO transfer unit in
> the storage device. Something like this (maybe we need an async flag but
> thats a minor detail now):
>
>         int (*cluster)(struct page *, unsigned long *boffset,
> 		unsigned long *poffset);

Makes sense, except that I don't see how (or why) the _VM_
should "know the optimal IO transfer unit". This sounds more
like a job for the IO subsystem and/or the filesystem, IMHO.

> "page" is from where the filesystem code should start its search
> for contiguous pages. boffset and poffset are passed by the VM
> code to know the logical "backwards offset" (number of
> contiguous pages going backwards from "page") and "forward
> offset" (cont pages going forward from "page") in the inode.

Yes, this makes a LOT of sense. I really like a pagecache
helper function so the filesystems can build their writeout
clusters easier.

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com.br/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] generic IO write clustering
  2001-01-20 15:57 ` Christoph Hellwig
@ 2001-01-20 15:24   ` Marcelo Tosatti
  2001-01-20 17:45     ` Christoph Hellwig
  0 siblings, 1 reply; 9+ messages in thread
From: Marcelo Tosatti @ 2001-01-20 15:24 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Rajagopal Ananthanarayanan, Rik van Riel, Stephen C. Tweedie,
	linux-kernel


On Sat, 20 Jan 2001, Christoph Hellwig wrote:

<snip>

> I think there is a big disadvantage of this appropeach:
> To find out which pages are clusterable, we need do do bmap/get_block,
> that means we have to go through the block-allocation functions, which
> is rather expensive, and then we have to do it again in writepage, for
> the pages that are actually clustered bt the VM.

In case the metadata was not already cached before ->cluster() (in this
case there is no disk IO at all), ->cluster() will cache it avoiding
further disk accesses by writepage (or writepages()).

> Another thing I dislike is that the flushing gets more complicated with
> yout VM-level clustering.  Now (and with my appropeach I'll describe
> below) flushing is write it out now and do whatever you else want,
> with you design it is 'find out pages beside this page in write out
> a bunch of them' - much more complicated.  I'd like it abstracted out.

I dont see your point here. What I'm missing?

> > The idea is to work with delayed allocated pages, too. A filesystem which
> > has this feature can, at its "cluster" operation, allocate delayed pages
> > contiguously on disk, and then return to the VM code which now can
> > potentially write a bunch of dirty pages in a few big IO operations.
> 
> That does also work nicely together with ->writepage level IO clustering.
> 
> > I'm sure that a bit of tuning to know the optimal cluster size will be
> > needed. Also some fs locking problems will appear.
> 
> Sure, but again that's an issue for every kind of IO clustering...
>
> 
> No my proposal.  I prefer doing it in writepage, as stated above.
> Writepage loops over the MAX_CLUSTERED_PAGES/2 dirty pages before and
> behind the initial page, it first uses test wether the page should be
> clustered (a callback from vm, highly 'balanceable'...), then does
> a bmap/get_block to check wether it is contingous.
>
> Finally the IO is submitted using a submit_bh loop, or when using a
> kiobuf-based IO path all clustered pages are passed down to ll_rw_kio
> in one piece.
> As you see the easy integration with the new bulk-IO mechanisms is also
> an advantage of this proposal, without the need a new multi-page a_op.

IMHO replicating the code is the worst thing. 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] generic IO write clustering
  2001-01-20  0:34 [RFC] generic IO write clustering Marcelo Tosatti
  2001-01-20  2:58 ` Rik van Riel
@ 2001-01-20 15:57 ` Christoph Hellwig
  2001-01-20 15:24   ` Marcelo Tosatti
  1 sibling, 1 reply; 9+ messages in thread
From: Christoph Hellwig @ 2001-01-20 15:57 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Rajagopal Ananthanarayanan, Rik van Riel, Stephen C. Tweedie,
	linux-kernel

In article <Pine.LNX.4.21.0101192142060.6167-100000@freak.distro.conectiva> you wrote:
> The write clustering issue has already been discussed (mainly at Miami)
> and the agreement, AFAIK, was to implement the write clustering at the
> per-address-space writepage() operation.

> IMO there are some problems if we implement the write clustering in this
> level:

>   - The filesystem does not have information (and should not have) about
>     limiting cluster size depending on memory shortage.

Agreed.

>   - By doing the write clustering at a higher level, we avoid a ton of
>     filesystems duplicating the code.

Most filesystems share their writepage-implementation, and most
others have special requirements on write clustering anyway.

For example extent-based filesystems (xfs, jfs) usually want to write out
more pages even if the VM doesn't see a need, just for effiecy reasons.

Network-based filesystems also need special care vs. writeclustering,
because the network behaves different from a typical disk...

> So what I suggest is to add a "cluster" operation to struct address_space
> which can be used by the VM code to know the optimal IO transfer unit in
> the storage device. Something like this (maybe we need an async flag but
> thats a minor detail now):

>         int (*cluster)(struct page *, unsigned long *boffset, 
> 		unsigned long *poffset);

> "page" is from where the filesystem code should start its search for
> contiguous pages. boffset and poffset are passed by the VM code to know
> the logical "backwards offset" (number of contiguous pages going backwards
> from "page") and "forward offset" (cont pages going forward from
> "page") in the inode.

I think there is a big disadvantage of this appropeach:
To find out which pages are clusterable, we need do do bmap/get_block,
that means we have to go through the block-allocation functions, which
is rather expensive, and then we have to do it again in writepage, for
the pages that are actually clustered bt the VM.

Another thing I dislike is that the flushing gets more complicated with
yout VM-level clustering.  Now (and with my appropeach I'll describe
below) flushing is write it out now and do whatever you else want,
with you design it is 'find out pages beside this page in write out
a bunch of them' - much more complicated.  I'd like it abstracted out.

> The idea is to work with delayed allocated pages, too. A filesystem which
> has this feature can, at its "cluster" operation, allocate delayed pages
> contiguously on disk, and then return to the VM code which now can
> potentially write a bunch of dirty pages in a few big IO operations.

That does also work nicely together with ->writepage level IO clustering.

> I'm sure that a bit of tuning to know the optimal cluster size will be
> needed. Also some fs locking problems will appear.

Sure, but again that's an issue for every kind of IO clustering...

No my proposal.  I prefer doing it in writepage, as stated above.
Writepage loops over the MAX_CLUSTERED_PAGES/2 dirty pages before and
behind the initial page, it first uses test wether the page should be
clustered (a callback from vm, highly 'balanceable'...), then does
a bmap/get_block to check wether it is contingous.

Finally the IO is submitted using a submit_bh loop, or when using a
kiobuf-based IO path all clustered pages are passed down to ll_rw_kio
in one piece.
As you see the easy integration with the new bulk-IO mechanisms is also
an advantage of this proposal, without the need a new multi-page a_op.

	Christoph

-- 
Whip me.  Beat me.  Make me maintain AIX.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] generic IO write clustering
  2001-01-20 17:45     ` Christoph Hellwig
@ 2001-01-20 16:00       ` Marcelo Tosatti
  2001-01-20 19:05         ` Christoph Hellwig
  0 siblings, 1 reply; 9+ messages in thread
From: Marcelo Tosatti @ 2001-01-20 16:00 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Rajagopal Ananthanarayanan, Rik van Riel, Stephen C. Tweedie,
	linux-kernel



On Sat, 20 Jan 2001, Christoph Hellwig wrote:

> On Sat, Jan 20, 2001 at 01:24:40PM -0200, Marcelo Tosatti wrote:
> > In case the metadata was not already cached before ->cluster() (in this
> > case there is no disk IO at all), ->cluster() will cache it avoiding
> > further disk accesses by writepage (or writepages()).
> 
> True.  But you have to go through ext2_get_branch (under the big kernel
> lock) - if we can do only one logical->physical block translations,
> why doing it multiple times?

You dont. If the metadata is cached and uptodate there is no need to call
get_block().


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] generic IO write clustering
  2001-01-20 15:24   ` Marcelo Tosatti
@ 2001-01-20 17:45     ` Christoph Hellwig
  2001-01-20 16:00       ` Marcelo Tosatti
  0 siblings, 1 reply; 9+ messages in thread
From: Christoph Hellwig @ 2001-01-20 17:45 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Rajagopal Ananthanarayanan, Rik van Riel, Stephen C. Tweedie,
	linux-kernel

On Sat, Jan 20, 2001 at 01:24:40PM -0200, Marcelo Tosatti wrote:
> In case the metadata was not already cached before ->cluster() (in this
> case there is no disk IO at all), ->cluster() will cache it avoiding
> further disk accesses by writepage (or writepages()).

True.  But you have to go through ext2_get_branch (under the big kernel
lock) - if we can do only one logical->physical block translations,
why doing it multiple times?

> > Another thing I dislike is that the flushing gets more complicated with
> > yout VM-level clustering.  Now (and with my appropeach I'll describe
> > below) flushing is write it out now and do whatever you else want,
> > with you design it is 'find out pages beside this page in write out
> > a bunch of them' - much more complicated.  I'd like it abstracted out.
> 
> I dont see your point here. What I'm missing?

It's just a matter of taste.
(I thought it was clear enough that there is no technical advantage...)

> [...] 
>
> IMHO replicating the code is the worst thing. 

This does not replicated the code.  The 'normal' filesystems share the
code, and the special filesystems want to their own clustering anyway.
(See the discussion on xfs-devel yesterday).

	Christoph

-- 
Whip me.  Beat me.  Make me maintain AIX.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] generic IO write clustering
  2001-01-20 19:05         ` Christoph Hellwig
@ 2001-01-20 17:55           ` Marcelo Tosatti
  0 siblings, 0 replies; 9+ messages in thread
From: Marcelo Tosatti @ 2001-01-20 17:55 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Rajagopal Ananthanarayanan, Rik van Riel, Stephen C. Tweedie,
	linux-kernel


On Sat, 20 Jan 2001, Christoph Hellwig wrote:

> On Sat, Jan 20, 2001 at 02:00:24PM -0200, Marcelo Tosatti wrote:
> > > True.  But you have to go through ext2_get_branch (under the big kernel
> > > lock) - if we can do only one logical->physical block translations,
> > > why doing it multiple times?
> > 
> > You dont. If the metadata is cached and uptodate there is no need to call
> > get_block().
> 
> Ups.  You are right for the stock tree - I was only looking at my kio tree,
> where it can't be cached due to the lack of buffer-cache usage...

Must be fixed.  

We need a higher level abstraction which can hold this (and other)
information.

Take a look at SGI's pagebuf page_buf_t. 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] generic IO write clustering
  2001-01-20 16:00       ` Marcelo Tosatti
@ 2001-01-20 19:05         ` Christoph Hellwig
  2001-01-20 17:55           ` Marcelo Tosatti
  0 siblings, 1 reply; 9+ messages in thread
From: Christoph Hellwig @ 2001-01-20 19:05 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Rajagopal Ananthanarayanan, Rik van Riel, Stephen C. Tweedie,
	linux-kernel

On Sat, Jan 20, 2001 at 02:00:24PM -0200, Marcelo Tosatti wrote:
> > True.  But you have to go through ext2_get_branch (under the big kernel
> > lock) - if we can do only one logical->physical block translations,
> > why doing it multiple times?
> 
> You dont. If the metadata is cached and uptodate there is no need to call
> get_block().

Ups.  You are right for the stock tree - I was only looking at my kio tree,
where it can't be cached due to the lack of buffer-cache usage...

	Christoph

-- 
Whip me.  Beat me.  Make me maintain AIX.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2001-01-20 19:45 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2001-01-20  0:34 [RFC] generic IO write clustering Marcelo Tosatti
2001-01-20  2:58 ` Rik van Riel
2001-01-20  1:52   ` Marcelo Tosatti
2001-01-20 15:57 ` Christoph Hellwig
2001-01-20 15:24   ` Marcelo Tosatti
2001-01-20 17:45     ` Christoph Hellwig
2001-01-20 16:00       ` Marcelo Tosatti
2001-01-20 19:05         ` Christoph Hellwig
2001-01-20 17:55           ` Marcelo Tosatti

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox