Bufferheads & page-cache reference

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Bufferheads & page-cache reference
@ 2005-02-14 19:30 Badari Pulavarty
  2005-02-14 19:31 ` [Ext2-devel] " Sonny Rao
  2005-02-14 21:40 ` Andrew Morton
  0 siblings, 2 replies; 27+ messages in thread
From: Badari Pulavarty @ 2005-02-14 19:30 UTC (permalink / raw)
  To: linux-fsdevel, ext2-devel; +Cc: Andrew Morton

Hi,,

I am trying to understand interactions between filesystem pagecache
pages & bufferhead associated with them. 

I was wondering if someone could help me clarify this..

I see that as part of bufferheads to page association, we get a
ref. on the page.

  create_empty_buffers() -> attach_page_buffers() -> page_cache_get()

I also see that this reference get dropped by ..

  shrink_list() -> try_to_release_page() ->
        try_to_free_buffers() -> drop_buffers() ->
                 __clear_page_buffers()-> page_cache_release();

So, it looks like we drop the reference on the page and disassociate
bufferheads from the page when VM wants to re-use the page. Only other
path, I see this can happen is through invalidate_mapping_pages().
Is this true ?

If I do fsync(), we flush the data - still leave the page & bufferhead
association. If I see lots of bufferheads even after fsync() is normal.
Correct ?

Thanks,
Badari

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Ext2-devel] Bufferheads & page-cache reference
  2005-02-14 19:30 Bufferheads & page-cache reference Badari Pulavarty
@ 2005-02-14 19:31 ` Sonny Rao
  2005-02-14 21:40 ` Andrew Morton
  1 sibling, 0 replies; 27+ messages in thread
From: Sonny Rao @ 2005-02-14 19:31 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: linux-fsdevel, ext2-devel, Andrew Morton

On Mon, Feb 14, 2005 at 11:30:16AM -0800, Badari Pulavarty wrote:
> Hi,,
> 
> I am trying to understand interactions between filesystem pagecache
> pages & bufferhead associated with them. 
> 
> I was wondering if someone could help me clarify this..
> 
> I see that as part of bufferheads to page association, we get a
> ref. on the page.
> 
>   create_empty_buffers() -> attach_page_buffers() -> page_cache_get()
> 
> I also see that this reference get dropped by ..
> 
>   shrink_list() -> try_to_release_page() ->
>         try_to_free_buffers() -> drop_buffers() ->
>                  __clear_page_buffers()-> page_cache_release();
> 
> So, it looks like we drop the reference on the page and disassociate
> bufferheads from the page when VM wants to re-use the page. Only other
> path, I see this can happen is through invalidate_mapping_pages().
> Is this true ?
> 
> If I do fsync(), we flush the data - still leave the page & bufferhead
> association. If I see lots of bufferheads even after fsync() is normal.
> Correct ?

Also Badari, a minor addition in the same vein, if a_ops->releasepage
gets called journal_try_to_free_buffers() will be called by
ext3_releasepage, which may actually release the buffers before
try_to_free_buffers() does.  I believe this should happen in the
shrink_list->try_to_release_page case.

I can't find any other place where the buffers would be released after
a normal ext3 write.

Here's a list of the callers of try_to_free_buffers:


release_buffer_page: 		fs/jbd/commit.c:72
journal_try_to_free_bufers:	fs/jbd/transaction.c:1633
mpage_writepage:		fs/mpage.c:563
try_to_release_page:		fs/buffer.c:1584
grow_dev_page:			fs/buffer.c:1137

XFS only
linvfs_release_page:		fs/xfs/linux-2.6/xfs_aops.c:1250

I looked at all of these and callers to "free_buffer_head" but didn't
see any obvious freeing until we release the whole page under memory
pressure, etc.

Sonny

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Bufferheads & page-cache reference
  2005-02-14 19:30 Bufferheads & page-cache reference Badari Pulavarty
  2005-02-14 19:31 ` [Ext2-devel] " Sonny Rao
@ 2005-02-14 21:40 ` Andrew Morton
  2005-02-14 22:10   ` William Lee Irwin III
  2005-02-15  1:27   ` Badari Pulavarty
  1 sibling, 2 replies; 27+ messages in thread
From: Andrew Morton @ 2005-02-14 21:40 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: linux-fsdevel, ext2-devel

Badari Pulavarty <pbadari@us.ibm.com> wrote:
>
> I see that as part of bufferheads to page association, we get a
>  ref. on the page.
> 
>    create_empty_buffers() -> attach_page_buffers() -> page_cache_get()
> 
>  I also see that this reference get dropped by ..
> 
>    shrink_list() -> try_to_release_page() ->
>          try_to_free_buffers() -> drop_buffers() ->
>                   __clear_page_buffers()-> page_cache_release();
> 
>  So, it looks like we drop the reference on the page and disassociate
>  bufferheads from the page when VM wants to re-use the page. Only other
>  path, I see this can happen is through invalidate_mapping_pages().
>  Is this true ?
> 
>  If I do fsync(), we flush the data - still leave the page & bufferhead
>  association. If I see lots of bufferheads even after fsync() is normal.
>  Correct ?

Seems about right.  There's also the buffer_heads_over_limit logic in
mm/vmscan.c and fs/buffer.c.  That logic has a hole in that it requires
that there be a highmem shortage before we start to reclaim the lowmem
buffer_heads, but it is somewhat helpful.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Bufferheads & page-cache reference
  2005-02-14 21:40 ` Andrew Morton
@ 2005-02-14 22:10   ` William Lee Irwin III
  2005-02-14 22:31     ` Andrew Morton
  2005-02-15  1:27   ` Badari Pulavarty
  1 sibling, 1 reply; 27+ messages in thread
From: William Lee Irwin III @ 2005-02-14 22:10 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Badari Pulavarty, linux-fsdevel, ext2-devel

On Mon, Feb 14, 2005 at 01:40:58PM -0800, Andrew Morton wrote:
> Seems about right.  There's also the buffer_heads_over_limit logic in
> mm/vmscan.c and fs/buffer.c.  That logic has a hole in that it requires
> that there be a highmem shortage before we start to reclaim the lowmem
> buffer_heads, but it is somewhat helpful.

It would be beneficial to close the hole in that logic. Do you have any
particularly preferred methods in mind?


-- wli

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Bufferheads & page-cache reference
  2005-02-14 22:10   ` William Lee Irwin III
@ 2005-02-14 22:31     ` Andrew Morton
  2005-02-14 22:50       ` William Lee Irwin III
  0 siblings, 1 reply; 27+ messages in thread
From: Andrew Morton @ 2005-02-14 22:31 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: pbadari, linux-fsdevel, ext2-devel

William Lee Irwin III <wli@holomorphy.com> wrote:
>
> On Mon, Feb 14, 2005 at 01:40:58PM -0800, Andrew Morton wrote:
> > Seems about right.  There's also the buffer_heads_over_limit logic in
> > mm/vmscan.c and fs/buffer.c.  That logic has a hole in that it requires
> > that there be a highmem shortage before we start to reclaim the lowmem
> > buffer_heads, but it is somewhat helpful.
> 
> It would be beneficial to close the hole in that logic.

Really?  Who's hurting?

> Do you have any
> particularly preferred methods in mind?
> 

None that are particularly elegant.  One approach might be to trigger a
highmem zone scan when we hit the limit, and to somehow tell that scan to
only do buffer_head stripping.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Bufferheads & page-cache reference
  2005-02-14 22:31     ` Andrew Morton
@ 2005-02-14 22:50       ` William Lee Irwin III
  2005-02-15  0:22         ` Badari Pulavarty
  0 siblings, 1 reply; 27+ messages in thread
From: William Lee Irwin III @ 2005-02-14 22:50 UTC (permalink / raw)
  To: Andrew Morton; +Cc: pbadari, linux-fsdevel, ext2-devel

On Mon, Feb 14, 2005 at 01:40:58PM -0800, Andrew Morton wrote:
>>> Seems about right.  There's also the buffer_heads_over_limit logic in
>>> mm/vmscan.c and fs/buffer.c.  That logic has a hole in that it requires
>>> that there be a highmem shortage before we start to reclaim the lowmem
>>> buffer_heads, but it is somewhat helpful.

William Lee Irwin III <wli@holomorphy.com> wrote:
>> It would be beneficial to close the hole in that logic.

On Mon, Feb 14, 2005 at 02:31:42PM -0800, Andrew Morton wrote:
> Really?  Who's hurting?

Apart from the fact that buffer_head proliferation has been a perennial
problem, there's little to go on. By and large 2.6.x production usage
is not yet very significant where I can see it, where the "production"
usage is largely more stressful on account of very long durations and
the variety of unusual situations to which the kernel is subjected.


William Lee Irwin III <wli@holomorphy.com> wrote:
>> Do you have any
>> particularly preferred methods in mind?

On Mon, Feb 14, 2005 at 02:31:42PM -0800, Andrew Morton wrote:
> None that are particularly elegant.  One approach might be to trigger a
> highmem zone scan when we hit the limit, and to somehow tell that scan to
> only do buffer_head stripping.

This is largely what I expected. I'll cook something up depending on
what conclusion is made from the above response.


-- wli

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Bufferheads & page-cache reference
  2005-02-14 22:50       ` William Lee Irwin III
@ 2005-02-15  0:22         ` Badari Pulavarty
  2005-02-15  2:57           ` Andrew Morton
  0 siblings, 1 reply; 27+ messages in thread
From: Badari Pulavarty @ 2005-02-15  0:22 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: Andrew Morton, linux-fsdevel, ext2-devel

On Mon, 2005-02-14 at 14:50, William Lee Irwin III wrote:
> On Mon, Feb 14, 2005 at 01:40:58PM -0800, Andrew Morton wrote:
> >>> Seems about right.  There's also the buffer_heads_over_limit logic in
> >>> mm/vmscan.c and fs/buffer.c.  That logic has a hole in that it requires
> >>> that there be a highmem shortage before we start to reclaim the lowmem
> >>> buffer_heads, but it is somewhat helpful.
> 
> William Lee Irwin III <wli@holomorphy.com> wrote:
> >> It would be beneficial to close the hole in that logic.
> 
> On Mon, Feb 14, 2005 at 02:31:42PM -0800, Andrew Morton wrote:
> > Really?  Who's hurting?
> 
> Apart from the fact that buffer_head proliferation has been a perennial
> problem, there's little to go on. By and large 2.6.x production usage
> is not yet very significant where I can see it, where the "production"
> usage is largely more stressful on account of very long durations and
> the variety of unusual situations to which the kernel is subjected.
> 

Now that we are on the subject of bufferheads and filesystem pagecache,
the reason I am looking thro the code closely is ..

Most of DB2 customers use filesystem for their database. Under the load,
they complain that entire memory in the system is used by filesystem
pagecache, freememory is very low and system starts swapping crazy OR
see lots of memory allocation failures and OOM killer kills db2.
slabinfo shows lots of bufferheads and VM folks claim that, bufferheads
are holding a ref. on the pages, so they can't use them. So, I want
to find the truth in the story and findout what exactly happening here
and which one to blame (VM or FS or IO problems) ?

BTW, all these on 2.4 kernels and I don't have a reproducible testcase
:(

Feb 7 05:35:17 nmcopsu41 kernel: ENOMEM in do_get_write_access,
retrying.
Feb 7 05:35:18 nmcopsu41 kernel: Out of Memory: Killed process 18517
(db2sysc).
Feb 7 05:35:25 nmcopsu41 kernel: Out of Memory: Killed process 18660
(db2sysc).
Feb 7 05:35:29 nmcopsu41 kernel: Out of Memory: Killed process 18873
(db2sysc).

total used free shared buffers cached
Mem: 16304560 16284152 20408 0 228428 15093736
-/+ buffers/cache: 961988 15342572
Swap: 35655616 24448 35631168
Total: 51960176 16308600 35651576

Thanks,
Badari

-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Bufferheads & page-cache reference
  2005-02-14 21:40 ` Andrew Morton
  2005-02-14 22:10   ` William Lee Irwin III
@ 2005-02-15  1:27   ` Badari Pulavarty
  2005-02-15  3:05     ` Andrew Morton
  1 sibling, 1 reply; 27+ messages in thread
From: Badari Pulavarty @ 2005-02-15  1:27 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-fsdevel, ext2-devel

On Mon, 2005-02-14 at 13:40, Andrew Morton wrote:
> Badari Pulavarty <pbadari@us.ibm.com> wrote:
> >
> > I see that as part of bufferheads to page association, we get a
> >  ref. on the page.
> > 
> >    create_empty_buffers() -> attach_page_buffers() -> page_cache_get()
> > 
> >  I also see that this reference get dropped by ..
> > 
> >    shrink_list() -> try_to_release_page() ->
> >          try_to_free_buffers() -> drop_buffers() ->
> >                   __clear_page_buffers()-> page_cache_release();
> > 
> >  So, it looks like we drop the reference on the page and disassociate
> >  bufferheads from the page when VM wants to re-use the page. Only other
> >  path, I see this can happen is through invalidate_mapping_pages().
> >  Is this true ?
> > 
> >  If I do fsync(), we flush the data - still leave the page & bufferhead
> >  association. If I see lots of bufferheads even after fsync() is normal.
> >  Correct ?
> 
> Seems about right.  There's also the buffer_heads_over_limit logic in
> mm/vmscan.c and fs/buffer.c.  That logic has a hole in that it requires
> that there be a highmem shortage before we start to reclaim the lowmem
> buffer_heads, but it is somewhat helpful.
> 

Is there anything wrong, if we tear down bufferheads after the
writepage/writepages is complete ? may be "-nobh" option for ext3 ?

Even for ext2 with "-nobh" and JFS - we seem to attach buffer heads
to page in __block_write_full_page() and leave them around. I was
thinking, they gets tossed out after the write-out completes. No ?

These bufferheads are driving me crazy :)

Thanks,
Badari


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Bufferheads & page-cache reference
  2005-02-15  0:22         ` Badari Pulavarty
@ 2005-02-15  2:57           ` Andrew Morton
  2005-02-15 16:03             ` Badari Pulavarty
  0 siblings, 1 reply; 27+ messages in thread
From: Andrew Morton @ 2005-02-15  2:57 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: wli, linux-fsdevel, ext2-devel

Badari Pulavarty <pbadari@us.ibm.com> wrote:
>
>  Most of DB2 customers use filesystem for their database. Under the load,
>  they complain that entire memory in the system is used by filesystem
>  pagecache, freememory is very low and system starts swapping crazy OR
>  see lots of memory allocation failures and OOM killer kills db2.
>  slabinfo shows lots of bufferheads and VM folks claim that, bufferheads
>  are holding a ref. on the pages, so they can't use them. So, I want
>  to find the truth in the story and findout what exactly happening here
>  and which one to blame (VM or FS or IO problems) ?
> 
>  BTW, all these on 2.4 kernels and I don't have a reproducible testcase
>  :(
> 
>  Feb 7 05:35:17 nmcopsu41 kernel: ENOMEM in do_get_write_access,
>  retrying.

Do these machines have a large amount of highmem?

If so, yes, you can oom because lots of highmem pages have buffer_heads
attached and you've run out of lowmem.  The 2.4 VM will go off looking for
lowmem pages to reclaim and will ignore the highmem pages because there's
no highmem shortage.  Consequently those buffer_heads don't get freed up
and we're unable to reclaim any lowmem -> oom.

Andrea did a patch along time ago (it'll be in suse 2.4 kernels) which,
under these circumstances, strip the buffers from those highmem pages when
they're encountered on the LRU.  From a quick read it seems that that patch
is not in current 2.4 kernels.

It's harder to do that in 2.6 because we have a separate LR per zone.

-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Bufferheads & page-cache reference
  2005-02-15  1:27   ` Badari Pulavarty
@ 2005-02-15  3:05     ` Andrew Morton
  2005-02-15 16:46       ` Badari Pulavarty
  0 siblings, 1 reply; 27+ messages in thread
From: Andrew Morton @ 2005-02-15  3:05 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: linux-fsdevel, ext2-devel

Badari Pulavarty <pbadari@us.ibm.com> wrote:
>
> Is there anything wrong, if we tear down bufferheads after the
>  writepage/writepages is complete ? may be "-nobh" option for ext3 ?

The I/O completion will happen in interrupt context, which isn't really a
good place to remove those bh's - the buffer_heads would need to be removed
from their journal_heads first.  That's assuming data=ordered.

For data=writeback we could perhaps inspect buffer_heads_over_limit in
end_buffer_async_write(), and if true, try to strip the buffers in
interrupt context.

For data=ordered the best place would be in checkpoint.c somewhere, where
we're detaching buffer_heads from a completed transaction: trylock the
page, strip the journal_heads, try to strip the buffers, unlock page.

>  Even for ext2 with "-nobh" and JFS - we seem to attach buffer heads
>  to page in __block_write_full_page() and leave them around. I was
>  thinking, they gets tossed out after the write-out completes. No ?

For ext2 nobh we never attach buffer_heads to regular pagecache pages. 
They're only used for metadata.  nobh_prepare_write() doesn't add them and
neither does writepages().

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Bufferheads & page-cache reference
  2005-02-15  2:57           ` Andrew Morton
@ 2005-02-15 16:03             ` Badari Pulavarty
  2005-02-15 17:26               ` Andrew Morton
  0 siblings, 1 reply; 27+ messages in thread
From: Badari Pulavarty @ 2005-02-15 16:03 UTC (permalink / raw)
  To: Andrew Morton; +Cc: wli, linux-fsdevel, ext2-devel

On Mon, 2005-02-14 at 18:57, Andrew Morton wrote:
> Badari Pulavarty <pbadari@us.ibm.com> wrote:
> >
> >  Most of DB2 customers use filesystem for their database. Under the load,
> >  they complain that entire memory in the system is used by filesystem
> >  pagecache, freememory is very low and system starts swapping crazy OR
> >  see lots of memory allocation failures and OOM killer kills db2.
> >  slabinfo shows lots of bufferheads and VM folks claim that, bufferheads
> >  are holding a ref. on the pages, so they can't use them. So, I want
> >  to find the truth in the story and findout what exactly happening here
> >  and which one to blame (VM or FS or IO problems) ?
> > 
> >  BTW, all these on 2.4 kernels and I don't have a reproducible testcase
> >  :(
> > 
> >  Feb 7 05:35:17 nmcopsu41 kernel: ENOMEM in do_get_write_access,
> >  retrying.
> 
> Do these machines have a large amount of highmem?
> 
> If so, yes, you can oom because lots of highmem pages have buffer_heads
> attached and you've run out of lowmem.  The 2.4 VM will go off looking for
> lowmem pages to reclaim and will ignore the highmem pages because there's
> no highmem shortage.  Consequently those buffer_heads don't get freed up
> and we're unable to reclaim any lowmem -> oom.
> 
> Andrea did a patch along time ago (it'll be in suse 2.4 kernels) which,
> under these circumstances, strip the buffers from those highmem pages when
> they're encountered on the LRU.  From a quick read it seems that that patch
> is not in current 2.4 kernels.
> 
> It's harder to do that in 2.6 because we have a separate LR per zone.

Our DB2 folks *claims* to have seen this problem both on ia32 and AMD64
customers.  So, I am not sure if its really only highmem related. Only
workaround seems to be configure DB2 to not to use more than 1.5GB on a
8GB RAM system :(

I have nothing much to go on, other than looking data from a sick 
machine. What should I be looking at, to narrow down the problem
some more ?

BTW, none of these BIG customers will take a patch to figure out
whats happening (since its on their production system) :(


Thanks,
Badari


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Bufferheads & page-cache reference
  2005-02-15  3:05     ` Andrew Morton
@ 2005-02-15 16:46       ` Badari Pulavarty
  2005-02-15 17:54         ` Andrew Morton
  0 siblings, 1 reply; 27+ messages in thread
From: Badari Pulavarty @ 2005-02-15 16:46 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-fsdevel, ext2-devel

On Mon, 2005-02-14 at 19:05, Andrew Morton wrote:
> Badari Pulavarty <pbadari@us.ibm.com> wrote:
> >
> > Is there anything wrong, if we tear down bufferheads after the
> >  writepage/writepages is complete ? may be "-nobh" option for ext3 ?
> 
> The I/O completion will happen in interrupt context, which isn't really a
> good place to remove those bh's - the buffer_heads would need to be removed
> from their journal_heads first.  That's assuming data=ordered.
> 
> For data=writeback we could perhaps inspect buffer_heads_over_limit in
> end_buffer_async_write(), and if true, try to strip the buffers in
> interrupt context.
> 
> For data=ordered the best place would be in checkpoint.c somewhere, where
> we're detaching buffer_heads from a completed transaction: trylock the
> page, strip the journal_heads, try to strip the buffers, unlock page.

Makes sense. But I am not going to do it, till I figure out *if* there
is a real issue with bufferheads & doing this makes sense.

> 
> >  Even for ext2 with "-nobh" and JFS - we seem to attach buffer heads
> >  to page in __block_write_full_page() and leave them around. I was
> >  thinking, they gets tossed out after the write-out completes. No ?
> 
> For ext2 nobh we never attach buffer_heads to regular pagecache pages. 
> They're only used for metadata.  nobh_prepare_write() doesn't add them and
> neither does writepages().

Hmm.. 

Yep. nobh_prepare_write() doesn't add any bufferheads. But
we call block_write_full_page() even for "nobh" case, which 
does create bufferheads, attaches to the page and operates
on them..

__block_write_full_page()
{
....
....
        if (!page_has_buffers(page)) {
                create_empty_buffers(page, 1 << inode->i_blkbits,
                                 (1 << BH_Dirty)|(1 << BH_Uptodate));
        }

...
}
	
I am missing something really simple here. What is it ?
 
Thanks,
Badari


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Bufferheads & page-cache reference
  2005-02-15 16:03             ` Badari Pulavarty
@ 2005-02-15 17:26               ` Andrew Morton
  0 siblings, 0 replies; 27+ messages in thread
From: Andrew Morton @ 2005-02-15 17:26 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: wli, linux-fsdevel, ext2-devel

Badari Pulavarty <pbadari@us.ibm.com> wrote:
>
> On Mon, 2005-02-14 at 18:57, Andrew Morton wrote:
> > Badari Pulavarty <pbadari@us.ibm.com> wrote:
> > >
> > >  Most of DB2 customers use filesystem for their database. Under the load,
> > >  they complain that entire memory in the system is used by filesystem
> > >  pagecache, freememory is very low and system starts swapping crazy OR
> > >  see lots of memory allocation failures and OOM killer kills db2.
> > >  slabinfo shows lots of bufferheads and VM folks claim that, bufferheads
> > >  are holding a ref. on the pages, so they can't use them. So, I want
> > >  to find the truth in the story and findout what exactly happening here
> > >  and which one to blame (VM or FS or IO problems) ?
> > > 
> > >  BTW, all these on 2.4 kernels and I don't have a reproducible testcase
> > >  :(
> > > 
> > >  Feb 7 05:35:17 nmcopsu41 kernel: ENOMEM in do_get_write_access,
> > >  retrying.
> > 
> > Do these machines have a large amount of highmem?
> > 
> > If so, yes, you can oom because lots of highmem pages have buffer_heads
> > attached and you've run out of lowmem.  The 2.4 VM will go off looking for
> > lowmem pages to reclaim and will ignore the highmem pages because there's
> > no highmem shortage.  Consequently those buffer_heads don't get freed up
> > and we're unable to reclaim any lowmem -> oom.
> > 
> > Andrea did a patch along time ago (it'll be in suse 2.4 kernels) which,
> > under these circumstances, strip the buffers from those highmem pages when
> > they're encountered on the LRU.  From a quick read it seems that that patch
> > is not in current 2.4 kernels.
> > 
> > It's harder to do that in 2.6 because we have a separate LR per zone.
> 
> Our DB2 folks *claims* to have seen this problem both on ia32 and AMD64
> customers.  So, I am not sure if its really only highmem related. Only
> workaround seems to be configure DB2 to not to use more than 1.5GB on a
> 8GB RAM system :(

It shouldn't happen on amd64.

> I have nothing much to go on, other than looking data from a sick 
> machine. What should I be looking at, to narrow down the problem
> some more ?

/proc/meminfo and /proc/slabinfo (especially the buffer_head line)

> BTW, none of these BIG customers will take a patch to figure out
> whats happening (since its on their production system) :(
> 

Yup.  What kernel(s) are they running?  I _think_ only suse have fixed that
problem.



-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Bufferheads & page-cache reference
  2005-02-15 16:46       ` Badari Pulavarty
@ 2005-02-15 17:54         ` Andrew Morton
  2005-02-15 18:15           ` Badari Pulavarty
                             ` (2 more replies)
  0 siblings, 3 replies; 27+ messages in thread
From: Andrew Morton @ 2005-02-15 17:54 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: linux-fsdevel, ext2-devel

Badari Pulavarty <pbadari@us.ibm.com> wrote:
>
> Yep. nobh_prepare_write() doesn't add any bufferheads. But
>  we call block_write_full_page() even for "nobh" case, which 
>  does create bufferheads, attaches to the page and operates
>  on them..

hmm, yeah, OK, we'll attach bh's in that case.  It's a rare case though -
when a dirty page falls off the end of the LRU.  There's no particular
reason why we cannot have a real mpage_writepage() which doesn't use bh's
and employ that.

I coulda sworn we used to have one.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Bufferheads & page-cache reference
  2005-02-15 17:54         ` Andrew Morton
@ 2005-02-15 18:15           ` Badari Pulavarty
  2005-02-15 19:07           ` Nikita Danilov
  2005-02-16  0:02           ` [RFC] [PATCH] Generic mpage_writepage() support Badari Pulavarty
  2 siblings, 0 replies; 27+ messages in thread
From: Badari Pulavarty @ 2005-02-15 18:15 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-fsdevel, ext2-devel

On Tue, 2005-02-15 at 09:54, Andrew Morton wrote:
> Badari Pulavarty <pbadari@us.ibm.com> wrote:
> >
> > Yep. nobh_prepare_write() doesn't add any bufferheads. But
> >  we call block_write_full_page() even for "nobh" case, which 
> >  does create bufferheads, attaches to the page and operates
> >  on them..
> 
> hmm, yeah, OK, we'll attach bh's in that case.  It's a rare case though -

Thanks. I just wanted to make sure I am not confused.

> when a dirty page falls off the end of the LRU.  There's no particular
> reason why we cannot have a real mpage_writepage() which doesn't use bh's
> and employ that.
> 
> I coulda sworn we used to have one.

Okay, I will cook it up. 

Thanks,
Badari


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Bufferheads & page-cache reference
  2005-02-15 17:54         ` Andrew Morton
  2005-02-15 18:15           ` Badari Pulavarty
@ 2005-02-15 19:07           ` Nikita Danilov
  2005-02-15 19:39             ` Badari Pulavarty
  2005-02-16  0:02           ` [RFC] [PATCH] Generic mpage_writepage() support Badari Pulavarty
  2 siblings, 1 reply; 27+ messages in thread
From: Nikita Danilov @ 2005-02-15 19:07 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-fsdevel, ext2-devel, Badari Pulavarty

Andrew Morton <akpm@osdl.org> writes:

> Badari Pulavarty <pbadari@us.ibm.com> wrote:
>>
>> Yep. nobh_prepare_write() doesn't add any bufferheads. But
>>  we call block_write_full_page() even for "nobh" case, which 
>>  does create bufferheads, attaches to the page and operates
>>  on them..
>
> hmm, yeah, OK, we'll attach bh's in that case.  It's a rare case though -
> when a dirty page falls off the end of the LRU.  There's no particular

Maybe DB2 dirties pages through mmap?

Nikita.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Bufferheads & page-cache reference
  2005-02-15 19:07           ` Nikita Danilov
@ 2005-02-15 19:39             ` Badari Pulavarty
  2005-02-15 20:00               ` Andrew Morton
  0 siblings, 1 reply; 27+ messages in thread
From: Badari Pulavarty @ 2005-02-15 19:39 UTC (permalink / raw)
  To: Nikita Danilov; +Cc: Andrew Morton, linux-fsdevel, ext2-devel

On Tue, 2005-02-15 at 11:07, Nikita Danilov wrote:
> Andrew Morton <akpm@osdl.org> writes:
> 
> > Badari Pulavarty <pbadari@us.ibm.com> wrote:
> >>
> >> Yep. nobh_prepare_write() doesn't add any bufferheads. But
> >>  we call block_write_full_page() even for "nobh" case, which 
> >>  does create bufferheads, attaches to the page and operates
> >>  on them..
> >
> > hmm, yeah, OK, we'll attach bh's in that case.  It's a rare case though -
> > when a dirty page falls off the end of the LRU.  There's no particular
> 
> Maybe DB2 dirties pages through mmap?
> 
> Nikita.

It does. Most databases dirty data using shared memory segments, but
they do write to filesystem directly also.

I am not looking at DB2 issue here. I am looking at bufferheads usage
in general. 

Thanks,
Badari



-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Bufferheads & page-cache reference
  2005-02-15 19:39             ` Badari Pulavarty
@ 2005-02-15 20:00               ` Andrew Morton
  0 siblings, 0 replies; 27+ messages in thread
From: Andrew Morton @ 2005-02-15 20:00 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: nikita, linux-fsdevel, ext2-devel

Badari Pulavarty <pbadari@us.ibm.com> wrote:
>
> > Maybe DB2 dirties pages through mmap?
> > 
> > Nikita.
> 
> It does. Most databases dirty data using shared memory segments, but
> they do write to filesystem directly also.

If DB2 leaves that dirty data floating about in memory for a long time,
it'll eventually get written back via block_write_full_page().  But if you
run fsync or msync, it'll go to disk via ->writepages() and no bh's will be
attached.



-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC] [PATCH] Generic mpage_writepage() support
  2005-02-15 17:54         ` Andrew Morton
  2005-02-15 18:15           ` Badari Pulavarty
  2005-02-15 19:07           ` Nikita Danilov
@ 2005-02-16  0:02           ` Badari Pulavarty
  2005-02-16 11:41             ` Nikita Danilov
  2 siblings, 1 reply; 27+ messages in thread
From: Badari Pulavarty @ 2005-02-16  0:02 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-fsdevel, ext2-devel

[-- Attachment #1: Type: text/plain, Size: 825 bytes --]

On Tue, 2005-02-15 at 09:54, Andrew Morton wrote:
> Badari Pulavarty <pbadari@us.ibm.com> wrote:
> >
> > Yep. nobh_prepare_write() doesn't add any bufferheads. But
> >  we call block_write_full_page() even for "nobh" case, which 
> >  does create bufferheads, attaches to the page and operates
> >  on them..
> 
> hmm, yeah, OK, we'll attach bh's in that case.  It's a rare case though -
> when a dirty page falls off the end of the LRU.  There's no particular
> reason why we cannot have a real mpage_writepage() which doesn't use bh's
> and employ that.
> 
> I coulda sworn we used to have one.

Hi Andrew,

Here is my first version of mpage_writepage() patch.
I haven't handle the "confused" case yet - I need to
pass a function pointer to handle it. Just for
initial code review. I am still testing it.

Thanks,
Badari



[-- Attachment #2: mpage-writepage.patch --]
[-- Type: text/x-patch, Size: 4263 bytes --]

diff -Narup -X dontdiff linux-2.6.10/fs/ext2/inode.c linux-2.6.10.nobh/fs/ext2/inode.c
--- linux-2.6.10/fs/ext2/inode.c	2004-12-24 13:33:51.000000000 -0800
+++ linux-2.6.10.nobh/fs/ext2/inode.c	2005-02-15 16:17:56.000000000 -0800
@@ -626,6 +626,12 @@ ext2_nobh_prepare_write(struct file *fil
 	return nobh_prepare_write(page,from,to,ext2_get_block);
 }
 
+static int ext2_nobh_writepage(struct page *page, 
+			struct writeback_control *wbc)
+{
+	return mpage_writepage(page, ext2_get_block, wbc);
+}
+
 static sector_t ext2_bmap(struct address_space *mapping, sector_t block)
 {
 	return generic_block_bmap(mapping,block,ext2_get_block);
@@ -675,7 +681,7 @@ struct address_space_operations ext2_aop
 struct address_space_operations ext2_nobh_aops = {
 	.readpage		= ext2_readpage,
 	.readpages		= ext2_readpages,
-	.writepage		= ext2_writepage,
+	.writepage		= ext2_nobh_writepage,
 	.sync_page		= block_sync_page,
 	.prepare_write		= ext2_nobh_prepare_write,
 	.commit_write		= nobh_commit_write,
diff -Narup -X dontdiff linux-2.6.10/fs/jfs/inode.c linux-2.6.10.nobh/fs/jfs/inode.c
--- linux-2.6.10/fs/jfs/inode.c	2004-12-24 13:33:48.000000000 -0800
+++ linux-2.6.10.nobh/fs/jfs/inode.c	2005-02-15 16:48:32.022885240 -0800
@@ -281,7 +281,7 @@ static int jfs_get_block(struct inode *i
 
 static int jfs_writepage(struct page *page, struct writeback_control *wbc)
 {
-	return block_write_full_page(page, jfs_get_block, wbc);
+	return mpage_writepage(page, jfs_get_block, wbc);
 }
 
 static int jfs_writepages(struct address_space *mapping,
diff -Narup -X dontdiff linux-2.6.10/fs/mpage.c linux-2.6.10.nobh/fs/mpage.c
--- linux-2.6.10/fs/mpage.c	2004-12-24 13:34:26.000000000 -0800
+++ linux-2.6.10.nobh/fs/mpage.c	2005-02-15 16:19:20.000000000 -0800
@@ -386,7 +386,7 @@ EXPORT_SYMBOL(mpage_readpage);
  * just allocate full-size (16-page) BIOs.
  */
 static struct bio *
-mpage_writepage(struct bio *bio, struct page *page, get_block_t get_block,
+__mpage_writepage(struct bio *bio, struct page *page, get_block_t get_block,
 	sector_t *last_block_in_bio, int *ret, struct writeback_control *wbc)
 {
 	struct address_space *mapping = page->mapping;
@@ -706,7 +706,7 @@ retry:
 							&mapping->flags);
 				}
 			} else {
-				bio = mpage_writepage(bio, page, get_block,
+				bio = __mpage_writepage(bio, page, get_block,
 						&last_block_in_bio, &ret, wbc);
 			}
 			if (ret || (--(wbc->nr_to_write) <= 0))
@@ -734,4 +734,59 @@ retry:
 		mpage_bio_submit(WRITE, bio);
 	return ret;
 }
+
+/*
+ * The generic ->writepage function for address_spaces
+ */
+int mpage_writepage(struct page *page, get_block_t *get_block,
+			struct writeback_control *wbc)
+{
+	struct inode * const inode = page->mapping->host;
+	loff_t i_size = i_size_read(inode);
+	const pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT;
+	unsigned offset;
+	void *kaddr;
+	int ret = 0;
+	struct bio *bio = NULL;
+	sector_t last_block_in_bio = 0;
+
+	/* Is the page fully inside i_size? */
+	if (page->index < end_index) {
+		bio = __mpage_writepage(bio, page, get_block,
+				 &last_block_in_bio, &ret, wbc);
+		goto done;
+	}
+
+	/* Is the page fully outside i_size? (truncate in progress) */
+	offset = i_size & (PAGE_CACHE_SIZE-1);
+	if (page->index >= end_index+1 || !offset) {
+		/*
+		 * The page may have dirty, unmapped buffers.  For example,
+		 * they may have been added in ext3_writepage().  Make them
+		 * freeable here, so the page does not leak.
+		 */
+		block_invalidatepage(page, 0);
+		unlock_page(page);
+		return 0; /* don't care */
+	}
+
+	/*
+	 * The page straddles i_size.  It must be zeroed out on each and every
+	 * writepage invokation because it may be mmapped.  "A file is mapped
+	 * in multiples of the page size.  For a file that is not a multiple of
+	 * the  page size, the remaining memory is zeroed when mapped, and
+	 * writes to that region are not written out to the file."
+	 */
+	kaddr = kmap_atomic(page, KM_USER0);
+	memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset);
+	flush_dcache_page(page);
+	kunmap_atomic(kaddr, KM_USER0);
+	bio = __mpage_writepage(bio, page, get_block, 
+			&last_block_in_bio, &ret, wbc);
+done:
+	if (bio)
+		mpage_bio_submit(WRITE, bio);
+	return ret;
+}
 EXPORT_SYMBOL(mpage_writepages);
+EXPORT_SYMBOL(mpage_writepage);

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC] [PATCH] Generic mpage_writepage() support
  2005-02-16  0:02           ` [RFC] [PATCH] Generic mpage_writepage() support Badari Pulavarty
@ 2005-02-16 11:41             ` Nikita Danilov
  2005-02-16 18:37               ` Badari Pulavarty
  0 siblings, 1 reply; 27+ messages in thread
From: Nikita Danilov @ 2005-02-16 11:41 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: linux-fsdevel, ext2-devel

Badari Pulavarty writes:
 > On Tue, 2005-02-15 at 09:54, Andrew Morton wrote:
 > > Badari Pulavarty <pbadari@us.ibm.com> wrote:
 > > >
 > > > Yep. nobh_prepare_write() doesn't add any bufferheads. But
 > > >  we call block_write_full_page() even for "nobh" case, which 
 > > >  does create bufferheads, attaches to the page and operates
 > > >  on them..
 > > 
 > > hmm, yeah, OK, we'll attach bh's in that case.  It's a rare case though -
 > > when a dirty page falls off the end of the LRU.  There's no particular
 > > reason why we cannot have a real mpage_writepage() which doesn't use bh's
 > > and employ that.
 > > 
 > > I coulda sworn we used to have one.
 > 
 > Hi Andrew,
 > 
 > Here is my first version of mpage_writepage() patch.
 > I haven't handle the "confused" case yet - I need to
 > pass a function pointer to handle it. Just for
 > initial code review. I am still testing it.
 > 
 > Thanks,
 > Badari
 > 
 > 
 > diff -Narup -X dontdiff linux-2.6.10/fs/ext2/inode.c linux-2.6.10.nobh/fs/ext2/inode.c
 > --- linux-2.6.10/fs/ext2/inode.c	2004-12-24 13:33:51.000000000 -0800

[...]

 >  	return ret;
 >  }
 > +
 > +/*
 > + * The generic ->writepage function for address_spaces
 > + */

This function doesn't look generic. It only works correctly with file
systems that store pointer to buffer head ring in page->private (at
least temporarily), otherwise code after page_has_buffers(page) check in
__mpage_writepage() will corrupt page->private.

Actually, this looks confusing. I thought that main idea of mpage.c is
to get rid of buffer heads, and switch everything to bios. But looking
at the current code it seems that buffer heads are striking back: code
simply assumes that PG_private means "buffers in page->private", making
mpage.c effectively useless for file systems using page->private for
something else.

There is another reason why mpage_writepage() is a problematic choice
for ->writepage: __mpage_writepage() calls
page->mapping->a_ops->writepage() in "confused" case, which sounds like
infinite recursion.

[...]

 > +	if (page->index >= end_index+1 || !offset) {
 > +		/*
 > +		 * The page may have dirty, unmapped buffers.  For example,
 > +		 * they may have been added in ext3_writepage().  Make them
 > +		 * freeable here, so the page does not leak.
 > +		 */
 > +		block_invalidatepage(page, 0);

Shouldn't this be

            page->mapping->a_ops->invalidatepage(page, 0)

? To preserve external appearance of "genericity", that is. :)

 > +		unlock_page(page);
 > +		return 0; /* don't care */
 > +	}
 > +
 > +	/*
 > +	 * The page straddles i_size.  It must be zeroed out on each and every
 > +	 * writepage invokation because it may be mmapped.  "A file is mapped

Typo: should be invocation (at least beyond Australia).

Nikita.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC] [PATCH] Generic mpage_writepage() support
  2005-02-16 11:41             ` Nikita Danilov
@ 2005-02-16 18:37               ` Badari Pulavarty
  2005-02-16 19:09                 ` Dave Kleikamp
  0 siblings, 1 reply; 27+ messages in thread
From: Badari Pulavarty @ 2005-02-16 18:37 UTC (permalink / raw)
  To: Nikita Danilov; +Cc: linux-fsdevel, ext2-devel

On Wed, 2005-02-16 at 03:41, Nikita Danilov wrote:
> Badari Pulavarty writes:
>  > On Tue, 2005-02-15 at 09:54, Andrew Morton wrote:
>  > > Badari Pulavarty <pbadari@us.ibm.com> wrote:
>  > > >
>  > > > Yep. nobh_prepare_write() doesn't add any bufferheads. But
>  > > >  we call block_write_full_page() even for "nobh" case, which 
>  > > >  does create bufferheads, attaches to the page and operates
>  > > >  on them..
>  > > 
>  > > hmm, yeah, OK, we'll attach bh's in that case.  It's a rare case though -
>  > > when a dirty page falls off the end of the LRU.  There's no particular
>  > > reason why we cannot have a real mpage_writepage() which doesn't use bh's
>  > > and employ that.
>  > > 
>  > > I coulda sworn we used to have one.
>  > 
>  > Hi Andrew,
>  > 
>  > Here is my first version of mpage_writepage() patch.
>  > I haven't handle the "confused" case yet - I need to
>  > pass a function pointer to handle it. Just for
>  > initial code review. I am still testing it.
>  > 
>  > Thanks,
>  > Badari
>  > 
>  > 
>  > diff -Narup -X dontdiff linux-2.6.10/fs/ext2/inode.c linux-2.6.10.nobh/fs/ext2/inode.c
>  > --- linux-2.6.10/fs/ext2/inode.c	2004-12-24 13:33:51.000000000 -0800
> 
> [...]
> 
>  >  	return ret;
>  >  }
>  > +
>  > +/*
>  > + * The generic ->writepage function for address_spaces
>  > + */
> 
> This function doesn't look generic. It only works correctly with file
> systems that store pointer to buffer head ring in page->private (at
> least temporarily), otherwise code after page_has_buffers(page) check in
> __mpage_writepage() will corrupt page->private.
> 
> Actually, this looks confusing. I thought that main idea of mpage.c is
> to get rid of buffer heads, and switch everything to bios. But looking
> at the current code it seems that buffer heads are striking back: code
> simply assumes that PG_private means "buffers in page->private", making
> mpage.c effectively useless for file systems using page->private for
> something else.

Let me put it this way, mpage.c doesn't create any new buffer heads,
but if there are already attached to a page, it can deal with them.

Yes. page->private is assumed for the bufferhead usage. Do you really
need for handling page->private for non-bufferhead usage ?

> 
> There is another reason why mpage_writepage() is a problematic choice
> for ->writepage: __mpage_writepage() calls
> page->mapping->a_ops->writepage() in "confused" case, which sounds like
> infinite recursion.

Yep. I need to fix (as mentioned earlier).

> 
> [...]
> 
>  > +	if (page->index >= end_index+1 || !offset) {
>  > +		/*
>  > +		 * The page may have dirty, unmapped buffers.  For example,
>  > +		 * they may have been added in ext3_writepage().  Make them
>  > +		 * freeable here, so the page does not leak.
>  > +		 */
>  > +		block_invalidatepage(page, 0);
> 
> Shouldn't this be
> 
>             page->mapping->a_ops->invalidatepage(page, 0)
> 
> ? To preserve external appearance of "genericity", that is. :)

I think so.

Thanks,
Badari



-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC] [PATCH] Generic mpage_writepage() support
  2005-02-16 18:37               ` Badari Pulavarty
@ 2005-02-16 19:09                 ` Dave Kleikamp
  2005-02-16 19:28                   ` Badari Pulavarty
  0 siblings, 1 reply; 27+ messages in thread
From: Dave Kleikamp @ 2005-02-16 19:09 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: Nikita Danilov, fsdevel, ext2-devel

On Wed, 2005-02-16 at 10:37 -0800, Badari Pulavarty wrote:

> Yes. page->private is assumed for the bufferhead usage. Do you really
> need for handling page->private for non-bufferhead usage ?

For what it's worth, I'm working on some changes to jfs that will use
page->private for non-bufferhead usage for metadata, but I won't be
using a generic writepage, so it's not an issue for me.

mpage.c already assumes page->private implies bufferheads, so it's not
completely generic.  Would implementing this as nobh_write_full_page, to
complement block_write_full_page, make sense?

Thanks,
Shaggy
-- 
David Kleikamp
IBM Linux Technology Center

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC] [PATCH] Generic mpage_writepage() support
  2005-02-16 19:09                 ` Dave Kleikamp
@ 2005-02-16 19:28                   ` Badari Pulavarty
  2005-02-16 19:43                     ` Dave Kleikamp
  0 siblings, 1 reply; 27+ messages in thread
From: Badari Pulavarty @ 2005-02-16 19:28 UTC (permalink / raw)
  To: Dave Kleikamp; +Cc: Nikita Danilov, fsdevel, ext2-devel

On Wed, 2005-02-16 at 11:09, Dave Kleikamp wrote:
> On Wed, 2005-02-16 at 10:37 -0800, Badari Pulavarty wrote:
> 
> > Yes. page->private is assumed for the bufferhead usage. Do you really
> > need for handling page->private for non-bufferhead usage ?
> 
> For what it's worth, I'm working on some changes to jfs that will use
> page->private for non-bufferhead usage for metadata, but I won't be
> using a generic writepage, so it's not an issue for me.

Nope. it would be an issue for you, since jfs uses mpage_writepages()
which uses the same code - which thinks page->private as bufferhead.

> 
> mpage.c already assumes page->private implies bufferheads, so it's not
> completely generic.  Would implementing this as nobh_write_full_page, to
> complement block_write_full_page, make sense?

I guess, it can be done. So to really deal with this, we need to come
up with generic writepage/writepages interfaces which doesn't deal
with bufferheads.

Thanks,
Badari



-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC] [PATCH] Generic mpage_writepage() support
  2005-02-16 19:28                   ` Badari Pulavarty
@ 2005-02-16 19:43                     ` Dave Kleikamp
  2005-02-16 21:38                       ` [Ext2-devel] " Badari Pulavarty
  0 siblings, 1 reply; 27+ messages in thread
From: Dave Kleikamp @ 2005-02-16 19:43 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: Nikita Danilov, fsdevel, ext2-devel

On Wed, 2005-02-16 at 11:28 -0800, Badari Pulavarty wrote:
> On Wed, 2005-02-16 at 11:09, Dave Kleikamp wrote:
> > On Wed, 2005-02-16 at 10:37 -0800, Badari Pulavarty wrote:
> > 
> > > Yes. page->private is assumed for the bufferhead usage. Do you really
> > > need for handling page->private for non-bufferhead usage ?
> > 
> > For what it's worth, I'm working on some changes to jfs that will use
> > page->private for non-bufferhead usage for metadata, but I won't be
> > using a generic writepage, so it's not an issue for me.
> 
> Nope. it would be an issue for you, since jfs uses mpage_writepages()
> which uses the same code - which thinks page->private as bufferhead.

The patch I am working on will call mpage_writepages() for metadata, but
will use my own writepage() rather than mpage_writepage(), and nothing
in mpage_writepages() will use page->private.

For normal data, page->private, if used at all, will be bufferheads.

> >
> > mpage.c already assumes page->private implies bufferheads, so it's not
> > completely generic.  Would implementing this as nobh_write_full_page, to
> > complement block_write_full_page, make sense?
> 
> I guess, it can be done. So to really deal with this, we need to come
> up with generic writepage/writepages interfaces which doesn't deal
> with bufferheads.

I'm not sure how useful that would be.  Are there any users of a
non-bufferhead page->private that want to call a generic writepage(s)?
In other words, if a generic function is sufficient, you probably
wouldn't be using page->private anyway.
-- 
David Kleikamp
IBM Linux Technology Center



-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Ext2-devel] Re: [RFC] [PATCH] Generic mpage_writepage() support
  2005-02-16 19:43                     ` Dave Kleikamp
@ 2005-02-16 21:38                       ` Badari Pulavarty
  2005-02-16 21:46                         ` Dave Kleikamp
  0 siblings, 1 reply; 27+ messages in thread
From: Badari Pulavarty @ 2005-02-16 21:38 UTC (permalink / raw)
  To: Dave Kleikamp; +Cc: Nikita Danilov, fsdevel, ext2-devel

On Wed, 2005-02-16 at 11:43, Dave Kleikamp wrote:
> On Wed, 2005-02-16 at 11:28 -0800, Badari Pulavarty wrote:
> > On Wed, 2005-02-16 at 11:09, Dave Kleikamp wrote:
> > > On Wed, 2005-02-16 at 10:37 -0800, Badari Pulavarty wrote:
> > > 
> > > > Yes. page->private is assumed for the bufferhead usage. Do you really
> > > > need for handling page->private for non-bufferhead usage ?
> > > 
> > > For what it's worth, I'm working on some changes to jfs that will use
> > > page->private for non-bufferhead usage for metadata, but I won't be
> > > using a generic writepage, so it's not an issue for me.
> > 
> > Nope. it would be an issue for you, since jfs uses mpage_writepages()
> > which uses the same code - which thinks page->private as bufferhead.
> 
> The patch I am working on will call mpage_writepages() for metadata, but
> will use my own writepage() rather than mpage_writepage(), and nothing
> in mpage_writepages() will use page->private.

Okay, that will work. Basically, you plan to call mpage_writepages()
with NULL as get_block(). Isn't it ?

> For normal data, page->private, if used at all, will be bufferheads.

Good.

> 
> > >
> > > mpage.c already assumes page->private implies bufferheads, so it's not
> > > completely generic.  Would implementing this as nobh_write_full_page, to
> > > complement block_write_full_page, make sense?
> > 
> > I guess, it can be done. So to really deal with this, we need to come
> > up with generic writepage/writepages interfaces which doesn't deal
> > with bufferheads.
> 
> I'm not sure how useful that would be.  Are there any users of a
> non-bufferhead page->private that want to call a generic writepage(s)?
> In other words, if a generic function is sufficient, you probably
> wouldn't be using page->private anyway.

This goes back to original question, is there a point in creating
new interfaces for writepage & writepages which doesn't assuming
page->private. So far, only users are ext2+nobh and JFS. But both
of them will not use page->private for anything else other than
bufferheads (if at all used). For both of these, the mpage_writepage()
I cooked up earlier would be good enough.

I guess, we will do it ONLY if someone really needs it. 

Thanks,
Badari


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Ext2-devel] Re: [RFC] [PATCH] Generic mpage_writepage() support
  2005-02-16 21:38                       ` [Ext2-devel] " Badari Pulavarty
@ 2005-02-16 21:46                         ` Dave Kleikamp
  2005-02-17  0:13                           ` [RFC] [PATCH] nobh_write_page() support Badari Pulavarty
  0 siblings, 1 reply; 27+ messages in thread
From: Dave Kleikamp @ 2005-02-16 21:46 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: Nikita Danilov, fsdevel, ext2-devel

On Wed, 2005-02-16 at 13:38 -0800, Badari Pulavarty wrote:
> On Wed, 2005-02-16 at 11:43, Dave Kleikamp wrote:
> > The patch I am working on will call mpage_writepages() for metadata, but
> > will use my own writepage() rather than mpage_writepage(), and nothing
> > in mpage_writepages() will use page->private.
> 
> Okay, that will work. Basically, you plan to call mpage_writepages()
> with NULL as get_block(). Isn't it ?

Yes

> > > > mpage.c already assumes page->private implies bufferheads, so it's not
> > > > completely generic.  Would implementing this as nobh_write_full_page, to
> > > > complement block_write_full_page, make sense?
> > > 
> > > I guess, it can be done. So to really deal with this, we need to come
> > > up with generic writepage/writepages interfaces which doesn't deal
> > > with bufferheads.
> > 
> > I'm not sure how useful that would be.  Are there any users of a
> > non-bufferhead page->private that want to call a generic writepage(s)?
> > In other words, if a generic function is sufficient, you probably
> > wouldn't be using page->private anyway.
> 
> This goes back to original question, is there a point in creating
> new interfaces for writepage & writepages which doesn't assuming
> page->private. So far, only users are ext2+nobh and JFS. But both
> of them will not use page->private for anything else other than
> bufferheads (if at all used). For both of these, the mpage_writepage()
> I cooked up earlier would be good enough.

I agree.  I just think that nobh_write_full_page() would be a more
consistent name than mpage_writepage().

> I guess, we will do it ONLY if someone really needs it. 

Agreed ... and I don't think anyone will need it.  :^)

> Thanks,
> Badari

Thanks,
Shaggy
-- 
David Kleikamp
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC] [PATCH] nobh_write_page() support
  2005-02-16 21:46                         ` Dave Kleikamp
@ 2005-02-17  0:13                           ` Badari Pulavarty
  0 siblings, 0 replies; 27+ messages in thread
From: Badari Pulavarty @ 2005-02-17  0:13 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Nikita Danilov, fsdevel, ext2-devel

[-- Attachment #1: Type: text/plain, Size: 457 bytes --]

Okay, in light of all the discussions we had - I reworked the
patch and decided to call it nobh_write_page() support (Thanks to
Shaggy). This is for the filesystems which doesn't need buffer 
heads, but do not use page->private for anything else.

Here is the patch, without handling "confused" case. Depending on
how folks view this, I would add support to handle infinite
recursion (for "confused" case).

Andrew, what do you think ?

Thanks,
Badari






[-- Attachment #2: nobh_writepage.patch --]
[-- Type: text/x-patch, Size: 6596 bytes --]

diff -Narup -X dontdiff linux-2.6.10/fs/buffer.c linux-2.6.10.nobh/fs/buffer.c
--- linux-2.6.10/fs/buffer.c	2004-12-24 13:34:58.000000000 -0800
+++ linux-2.6.10.nobh/fs/buffer.c	2005-02-16 16:51:20.708492872 -0800
@@ -39,6 +39,7 @@
 #include <linux/notifier.h>
 #include <linux/cpu.h>
 #include <linux/bitops.h>
+#include <linux/mpage.h>
 
 static int fsync_buffers_list(spinlock_t *lock, struct list_head *list);
 static void invalidate_bh_lrus(void);
@@ -2492,6 +2493,55 @@ int nobh_commit_write(struct file *file,
 EXPORT_SYMBOL(nobh_commit_write);
 
 /*
+ * nobh_write_page() - based on block_full_write_page() except
+ * that it tries to operate without attaching bufferheads to
+ * the page.
+ */
+int nobh_write_page(struct page *page, get_block_t *get_block,
+			struct writeback_control *wbc)
+{
+	struct inode * const inode = page->mapping->host;
+	loff_t i_size = i_size_read(inode);
+	const pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT;
+	unsigned offset;
+	void *kaddr;
+
+	/* Is the page fully inside i_size? */
+	if (page->index < end_index) {
+		return mpage_writepage(page, get_block, wbc);
+	}
+
+	/* Is the page fully outside i_size? (truncate in progress) */
+	offset = i_size & (PAGE_CACHE_SIZE-1);
+	if (page->index >= end_index+1 || !offset) {
+		/*
+		 * The page may have dirty, unmapped buffers.  For example,
+		 * they may have been added in ext3_writepage().  Make them
+		 * freeable here, so the page does not leak.
+		 */
+#if 0
+		/* I am not really sure, if we need this */
+		do_invalidatepage(page, 0);
+#endif
+		unlock_page(page);
+		return 0; /* don't care */
+	}
+
+	/*
+	 * The page straddles i_size.  It must be zeroed out on each and every
+	 * writepage invocation because it may be mmapped.  "A file is mapped
+	 * in multiples of the page size.  For a file that is not a multiple of
+	 * the  page size, the remaining memory is zeroed when mapped, and
+	 * writes to that region are not written out to the file."
+	 */
+	kaddr = kmap_atomic(page, KM_USER0);
+	memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset);
+	flush_dcache_page(page);
+	kunmap_atomic(kaddr, KM_USER0);
+	return mpage_writepage(page, get_block, wbc);
+}
+EXPORT_SYMBOL(nobh_write_page);
+
+/*
  * This function assumes that ->prepare_write() uses nobh_prepare_write().
  */
 int nobh_truncate_page(struct address_space *mapping, loff_t from)
diff -Narup -X dontdiff linux-2.6.10/fs/ext2/inode.c linux-2.6.10.nobh/fs/ext2/inode.c
--- linux-2.6.10/fs/ext2/inode.c	2004-12-24 13:33:51.000000000 -0800
+++ linux-2.6.10.nobh/fs/ext2/inode.c	2005-02-16 16:27:32.000000000 -0800
@@ -626,6 +626,12 @@ ext2_nobh_prepare_write(struct file *fil
 	return nobh_prepare_write(page,from,to,ext2_get_block);
 }
 
+static int ext2_nobh_writepage(struct page *page, 
+			struct writeback_control *wbc)
+{
+	return nobh_write_page(page, ext2_get_block, wbc);
+}
+
 static sector_t ext2_bmap(struct address_space *mapping, sector_t block)
 {
 	return generic_block_bmap(mapping,block,ext2_get_block);
@@ -675,7 +681,7 @@ struct address_space_operations ext2_aop
 struct address_space_operations ext2_nobh_aops = {
 	.readpage		= ext2_readpage,
 	.readpages		= ext2_readpages,
-	.writepage		= ext2_writepage,
+	.writepage		= ext2_nobh_writepage,
 	.sync_page		= block_sync_page,
 	.prepare_write		= ext2_nobh_prepare_write,
 	.commit_write		= nobh_commit_write,
diff -Narup -X dontdiff linux-2.6.10/fs/jfs/inode.c linux-2.6.10.nobh/fs/jfs/inode.c
--- linux-2.6.10/fs/jfs/inode.c	2004-12-24 13:33:48.000000000 -0800
+++ linux-2.6.10.nobh/fs/jfs/inode.c	2005-02-16 16:27:42.000000000 -0800
@@ -281,7 +281,7 @@ static int jfs_get_block(struct inode *i
 
 static int jfs_writepage(struct page *page, struct writeback_control *wbc)
 {
-	return block_write_full_page(page, jfs_get_block, wbc);
+	return nobh_write_page(page, jfs_get_block, wbc);
 }
 
 static int jfs_writepages(struct address_space *mapping,
diff -Narup -X dontdiff linux-2.6.10/fs/mpage.c linux-2.6.10.nobh/fs/mpage.c
--- linux-2.6.10/fs/mpage.c	2004-12-24 13:34:26.000000000 -0800
+++ linux-2.6.10.nobh/fs/mpage.c	2005-02-16 16:23:19.000000000 -0800
@@ -386,7 +386,7 @@ EXPORT_SYMBOL(mpage_readpage);
  * just allocate full-size (16-page) BIOs.
  */
 static struct bio *
-mpage_writepage(struct bio *bio, struct page *page, get_block_t get_block,
+__mpage_writepage(struct bio *bio, struct page *page, get_block_t get_block,
 	sector_t *last_block_in_bio, int *ret, struct writeback_control *wbc)
 {
 	struct address_space *mapping = page->mapping;
@@ -706,7 +706,7 @@ retry:
 							&mapping->flags);
 				}
 			} else {
-				bio = mpage_writepage(bio, page, get_block,
+				bio = __mpage_writepage(bio, page, get_block,
 						&last_block_in_bio, &ret, wbc);
 			}
 			if (ret || (--(wbc->nr_to_write) <= 0))
@@ -734,4 +734,21 @@ retry:
 		mpage_bio_submit(WRITE, bio);
 	return ret;
 }
+
+int
+mpage_writepage(struct page *page, get_block_t get_block,
+	struct writeback_control *wbc)
+{
+	int ret = 0;
+	struct bio *bio = NULL;
+	sector_t last_block_in_bio = 0;
+
+	bio = __mpage_writepage(bio, page, get_block,
+			&last_block_in_bio, &ret, wbc);
+	if (bio)
+		mpage_bio_submit(WRITE, bio);
+
+	return ret;
+}
+
 EXPORT_SYMBOL(mpage_writepages);
diff -Narup -X dontdiff linux-2.6.10/include/linux/buffer_head.h linux-2.6.10.nobh/include/linux/buffer_head.h
--- linux-2.6.10/include/linux/buffer_head.h	2004-12-24 13:33:49.000000000 -0800
+++ linux-2.6.10.nobh/include/linux/buffer_head.h	2005-02-16 16:22:51.000000000 -0800
@@ -203,6 +203,9 @@ int file_fsync(struct file *, struct den
 int nobh_prepare_write(struct page*, unsigned, unsigned, get_block_t*);
 int nobh_commit_write(struct file *, struct page *, unsigned, unsigned);
 int nobh_truncate_page(struct address_space *, loff_t);
+int nobh_write_page(struct page *page, get_block_t *get_block,
+                        struct writeback_control *wbc);
+
 
 /*
  * inline definitions
diff -Narup -X dontdiff linux-2.6.10/include/linux/mpage.h linux-2.6.10.nobh/include/linux/mpage.h
--- linux-2.6.10/include/linux/mpage.h	2004-12-24 13:34:32.000000000 -0800
+++ linux-2.6.10.nobh/include/linux/mpage.h	2005-02-15 16:18:50.000000000 -0800
@@ -17,6 +17,8 @@ int mpage_readpages(struct address_space
 int mpage_readpage(struct page *page, get_block_t get_block);
 int mpage_writepages(struct address_space *mapping,
 		struct writeback_control *wbc, get_block_t get_block);
+int mpage_writepage(struct page *page, get_block_t *get_block,
+		struct writeback_control *wbc);
 
 static inline int
 generic_writepages(struct address_space *mapping, struct writeback_control *wbc)

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2005-02-17  0:13 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-02-14 19:30 Bufferheads & page-cache reference Badari Pulavarty
2005-02-14 19:31 ` [Ext2-devel] " Sonny Rao
2005-02-14 21:40 ` Andrew Morton
2005-02-14 22:10   ` William Lee Irwin III
2005-02-14 22:31     ` Andrew Morton
2005-02-14 22:50       ` William Lee Irwin III
2005-02-15  0:22         ` Badari Pulavarty
2005-02-15  2:57           ` Andrew Morton
2005-02-15 16:03             ` Badari Pulavarty
2005-02-15 17:26               ` Andrew Morton
2005-02-15  1:27   ` Badari Pulavarty
2005-02-15  3:05     ` Andrew Morton
2005-02-15 16:46       ` Badari Pulavarty
2005-02-15 17:54         ` Andrew Morton
2005-02-15 18:15           ` Badari Pulavarty
2005-02-15 19:07           ` Nikita Danilov
2005-02-15 19:39             ` Badari Pulavarty
2005-02-15 20:00               ` Andrew Morton
2005-02-16  0:02           ` [RFC] [PATCH] Generic mpage_writepage() support Badari Pulavarty
2005-02-16 11:41             ` Nikita Danilov
2005-02-16 18:37               ` Badari Pulavarty
2005-02-16 19:09                 ` Dave Kleikamp
2005-02-16 19:28                   ` Badari Pulavarty
2005-02-16 19:43                     ` Dave Kleikamp
2005-02-16 21:38                       ` [Ext2-devel] " Badari Pulavarty
2005-02-16 21:46                         ` Dave Kleikamp
2005-02-17  0:13                           ` [RFC] [PATCH] nobh_write_page() support Badari Pulavarty

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).