OOM kills when running fsstress on CIFS

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* OOM kills when running fsstress on CIFS
@ 2010-05-25 10:57 Jeff Layton
  2010-05-25 11:16 ` Nick Piggin
  0 siblings, 1 reply; 5+ messages in thread
From: Jeff Layton @ 2010-05-25 10:57 UTC (permalink / raw)
  To: Nick Piggin, Steve French, Dave Kleikamp; +Cc: linux-fsdevel, linux-cifs-client

Since 2.6.34, I've been able to consistently reproduce OOM kills when running fsstress (from the LTP suite) on CIFS. I spent some time yesterday and bisected it down to this patch:

---------------------[snip]---------------------
commit 315e995c63a15cb4d4efdbfd70fe2db191917f7a
Author: Nick Piggin <npiggin@suse.de>
Date:   Wed Apr 21 03:18:28 2010 +0000

    [CIFS] use add_to_page_cache_lru
    
    add_to_page_cache_lru is exported, so it should be used. Benefits over
    using a private pagevec: neater code, 128 bytes fewer stack used, percpu
    lru ordering is preserved, and finally don't need to flush pagevec
    before returning so batching may be shared with other LRU insertions.
    
    Signed-off-by: Nick Piggin <npiggin@suse.de>
    Reviewed-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
    Signed-off-by: Steve French <sfrench@us.ibm.com>
---------------------[snip]---------------------

Here's how I've been reproducing it:

Mount up a samba share with -o sec=krb5i,nounix,noserverino

Run: fsstress -d /path/to/dir/on/cifs/ -n 1000 -l0 -p8

...within an hour or two, I start getting OOM kills. After backing out
the patch above, I was able to run the test overnight. I'm not sure yet
what the actual problem is, but there seems to be something wrong with
that patch.

Thoughts?
-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: OOM kills when running fsstress on CIFS
  2010-05-25 10:57 OOM kills when running fsstress on CIFS Jeff Layton
@ 2010-05-25 11:16 ` Nick Piggin
  2010-05-25 11:49   ` Jeff Layton
  0 siblings, 1 reply; 5+ messages in thread
From: Nick Piggin @ 2010-05-25 11:16 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Steve French, Dave Kleikamp, linux-cifs-client, linux-fsdevel

On Tue, May 25, 2010 at 06:57:05AM -0400, Jeff Layton wrote:
> Since 2.6.34, I've been able to consistently reproduce OOM kills when running fsstress (from the LTP suite) on CIFS. I spent some time yesterday and bisected it down to this patch:
> 
> ---------------------[snip]---------------------
> commit 315e995c63a15cb4d4efdbfd70fe2db191917f7a
> Author: Nick Piggin <npiggin@suse.de>
> Date:   Wed Apr 21 03:18:28 2010 +0000
> 
>     [CIFS] use add_to_page_cache_lru
>     
>     add_to_page_cache_lru is exported, so it should be used. Benefits over
>     using a private pagevec: neater code, 128 bytes fewer stack used, percpu
>     lru ordering is preserved, and finally don't need to flush pagevec
>     before returning so batching may be shared with other LRU insertions.
>     
>     Signed-off-by: Nick Piggin <npiggin@suse.de>
>     Reviewed-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
>     Signed-off-by: Steve French <sfrench@us.ibm.com>
> ---------------------[snip]---------------------
> 
> Here's how I've been reproducing it:
> 
> Mount up a samba share with -o sec=krb5i,nounix,noserverino
> 
> Run: fsstress -d /path/to/dir/on/cifs/ -n 1000 -l0 -p8
> 
> ...within an hour or two, I start getting OOM kills. After backing out
> the patch above, I was able to run the test overnight. I'm not sure yet
> what the actual problem is, but there seems to be something wrong with
> that patch.
> 
> Thoughts?

Yep, it's my fault. The problem is the refcounting. Previously the
code hands off the references to the LRU, wheras now the lru takes
a new reference. (the other filesystems converted to use this
function seemed to more conventionally open-code lru_cache_add).

Can we get rid of a refcount increment anywhere? Otherwise we'll
need to just drop the references after adding the pages.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: OOM kills when running fsstress on CIFS
  2010-05-25 11:16 ` Nick Piggin
@ 2010-05-25 11:49   ` Jeff Layton
  2010-05-25 11:54     ` Nick Piggin
  0 siblings, 1 reply; 5+ messages in thread
From: Jeff Layton @ 2010-05-25 11:49 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Steve French, Dave Kleikamp, linux-cifs-client, linux-fsdevel

On Tue, 25 May 2010 21:16:39 +1000
Nick Piggin <npiggin@suse.de> wrote:

> On Tue, May 25, 2010 at 06:57:05AM -0400, Jeff Layton wrote:
> > Since 2.6.34, I've been able to consistently reproduce OOM kills when running fsstress (from the LTP suite) on CIFS. I spent some time yesterday and bisected it down to this patch:
> > 
> > ---------------------[snip]---------------------
> > commit 315e995c63a15cb4d4efdbfd70fe2db191917f7a
> > Author: Nick Piggin <npiggin@suse.de>
> > Date:   Wed Apr 21 03:18:28 2010 +0000
> > 
> >     [CIFS] use add_to_page_cache_lru
> >     
> >     add_to_page_cache_lru is exported, so it should be used. Benefits over
> >     using a private pagevec: neater code, 128 bytes fewer stack used, percpu
> >     lru ordering is preserved, and finally don't need to flush pagevec
> >     before returning so batching may be shared with other LRU insertions.
> >     
> >     Signed-off-by: Nick Piggin <npiggin@suse.de>
> >     Reviewed-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
> >     Signed-off-by: Steve French <sfrench@us.ibm.com>
> > ---------------------[snip]---------------------
> > 
> > Here's how I've been reproducing it:
> > 
> > Mount up a samba share with -o sec=krb5i,nounix,noserverino
> > 
> > Run: fsstress -d /path/to/dir/on/cifs/ -n 1000 -l0 -p8
> > 
> > ...within an hour or two, I start getting OOM kills. After backing out
> > the patch above, I was able to run the test overnight. I'm not sure yet
> > what the actual problem is, but there seems to be something wrong with
> > that patch.
> > 
> > Thoughts?
> 
> Yep, it's my fault. The problem is the refcounting. Previously the
> code hands off the references to the LRU, wheras now the lru takes
> a new reference. (the other filesystems converted to use this
> function seemed to more conventionally open-code lru_cache_add).
> 
> Can we get rid of a refcount increment anywhere? Otherwise we'll
> need to just drop the references after adding the pages.
> 

The only caller of this function is cifs_readpages, and I don't see
where it takes any references. I'm guessing that the pages come from
the VFS with the refcount already incremented?

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: OOM kills when running fsstress on CIFS
  2010-05-25 11:49   ` Jeff Layton
@ 2010-05-25 11:54     ` Nick Piggin
  2010-05-25 16:32       ` Jeff Layton
  0 siblings, 1 reply; 5+ messages in thread
From: Nick Piggin @ 2010-05-25 11:54 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Steve French, Dave Kleikamp, linux-cifs-client, linux-fsdevel

On Tue, May 25, 2010 at 07:49:29AM -0400, Jeff Layton wrote:
> On Tue, 25 May 2010 21:16:39 +1000
> Nick Piggin <npiggin@suse.de> wrote:
> 
> > On Tue, May 25, 2010 at 06:57:05AM -0400, Jeff Layton wrote:
> > > Since 2.6.34, I've been able to consistently reproduce OOM kills when running fsstress (from the LTP suite) on CIFS. I spent some time yesterday and bisected it down to this patch:
> > > 
> > > ---------------------[snip]---------------------
> > > commit 315e995c63a15cb4d4efdbfd70fe2db191917f7a
> > > Author: Nick Piggin <npiggin@suse.de>
> > > Date:   Wed Apr 21 03:18:28 2010 +0000
> > > 
> > >     [CIFS] use add_to_page_cache_lru
> > >     
> > >     add_to_page_cache_lru is exported, so it should be used. Benefits over
> > >     using a private pagevec: neater code, 128 bytes fewer stack used, percpu
> > >     lru ordering is preserved, and finally don't need to flush pagevec
> > >     before returning so batching may be shared with other LRU insertions.
> > >     
> > >     Signed-off-by: Nick Piggin <npiggin@suse.de>
> > >     Reviewed-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
> > >     Signed-off-by: Steve French <sfrench@us.ibm.com>
> > > ---------------------[snip]---------------------
> > > 
> > > Here's how I've been reproducing it:
> > > 
> > > Mount up a samba share with -o sec=krb5i,nounix,noserverino
> > > 
> > > Run: fsstress -d /path/to/dir/on/cifs/ -n 1000 -l0 -p8
> > > 
> > > ...within an hour or two, I start getting OOM kills. After backing out
> > > the patch above, I was able to run the test overnight. I'm not sure yet
> > > what the actual problem is, but there seems to be something wrong with
> > > that patch.
> > > 
> > > Thoughts?
> > 
> > Yep, it's my fault. The problem is the refcounting. Previously the
> > code hands off the references to the LRU, wheras now the lru takes
> > a new reference. (the other filesystems converted to use this
> > function seemed to more conventionally open-code lru_cache_add).
> > 
> > Can we get rid of a refcount increment anywhere? Otherwise we'll
> > need to just drop the references after adding the pages.
> > 
> 
> The only caller of this function is cifs_readpages, and I don't see
> where it takes any references. I'm guessing that the pages come from
> the VFS with the refcount already incremented?

Yep. I think we should just page_cache_release after doing the
add_to_page_cache_lru, like the generic code does.

It's a little suboptimal tinkering with refcounts like that, but
a cleanup pass to fix it up all over the tree and allow
add_to_page_cache_lru to take over a reference would be better
place to fix that.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: OOM kills when running fsstress on CIFS
  2010-05-25 11:54     ` Nick Piggin
@ 2010-05-25 16:32       ` Jeff Layton
  0 siblings, 0 replies; 5+ messages in thread
From: Jeff Layton @ 2010-05-25 16:32 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel, linux-cifs-client

On Tue, 25 May 2010 21:54:31 +1000
Nick Piggin <npiggin@suse.de> wrote:

> On Tue, May 25, 2010 at 07:49:29AM -0400, Jeff Layton wrote:
> > On Tue, 25 May 2010 21:16:39 +1000
> > Nick Piggin <npiggin@suse.de> wrote:
> > 
> > > On Tue, May 25, 2010 at 06:57:05AM -0400, Jeff Layton wrote:
> > > > Since 2.6.34, I've been able to consistently reproduce OOM kills when running fsstress (from the LTP suite) on CIFS. I spent some time yesterday and bisected it down to this patch:
> > > > 
> > > > ---------------------[snip]---------------------
> > > > commit 315e995c63a15cb4d4efdbfd70fe2db191917f7a
> > > > Author: Nick Piggin <npiggin@suse.de>
> > > > Date:   Wed Apr 21 03:18:28 2010 +0000
> > > > 
> > > >     [CIFS] use add_to_page_cache_lru
> > > >     
> > > >     add_to_page_cache_lru is exported, so it should be used. Benefits over
> > > >     using a private pagevec: neater code, 128 bytes fewer stack used, percpu
> > > >     lru ordering is preserved, and finally don't need to flush pagevec
> > > >     before returning so batching may be shared with other LRU insertions.
> > > >     
> > > >     Signed-off-by: Nick Piggin <npiggin@suse.de>
> > > >     Reviewed-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
> > > >     Signed-off-by: Steve French <sfrench@us.ibm.com>
> > > > ---------------------[snip]---------------------
> > > > 
> > > > Here's how I've been reproducing it:
> > > > 
> > > > Mount up a samba share with -o sec=krb5i,nounix,noserverino
> > > > 
> > > > Run: fsstress -d /path/to/dir/on/cifs/ -n 1000 -l0 -p8
> > > > 
> > > > ...within an hour or two, I start getting OOM kills. After backing out
> > > > the patch above, I was able to run the test overnight. I'm not sure yet
> > > > what the actual problem is, but there seems to be something wrong with
> > > > that patch.
> > > > 
> > > > Thoughts?
> > > 
> > > Yep, it's my fault. The problem is the refcounting. Previously the
> > > code hands off the references to the LRU, wheras now the lru takes
> > > a new reference. (the other filesystems converted to use this
> > > function seemed to more conventionally open-code lru_cache_add).
> > > 
> > > Can we get rid of a refcount increment anywhere? Otherwise we'll
> > > need to just drop the references after adding the pages.
> > > 
> > 
> > The only caller of this function is cifs_readpages, and I don't see
> > where it takes any references. I'm guessing that the pages come from
> > the VFS with the refcount already incremented?
> 
> Yep. I think we should just page_cache_release after doing the
> add_to_page_cache_lru, like the generic code does.
> 
> It's a little suboptimal tinkering with refcounts like that, but
> a cleanup pass to fix it up all over the tree and allow
> add_to_page_cache_lru to take over a reference would be better
> place to fix that.
> 

Ok, I tested out the patch that I just sent and it seems to have fixed
the problem. Look ok?

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2010-05-25 16:32 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-05-25 10:57 OOM kills when running fsstress on CIFS Jeff Layton
2010-05-25 11:16 ` Nick Piggin
2010-05-25 11:49   ` Jeff Layton
2010-05-25 11:54     ` Nick Piggin
2010-05-25 16:32       ` Jeff Layton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).