Re: [rfc][patch] fs: turn iprune_mutex into rwsem

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Nick Piggin <npiggin@suse.de>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, jack@suse.cz
Subject: Re: [rfc][patch] fs: turn iprune_mutex into rwsem
Date: Sat, 15 Aug 2009 06:45:34 +0200	[thread overview]
Message-ID: <20090815044534.GB19195@wotan.suse.de> (raw)
In-Reply-To: <20090814155847.860dd23f.akpm@linux-foundation.org>

On Fri, Aug 14, 2009 at 03:58:47PM -0700, Andrew Morton wrote:
> On Fri, 14 Aug 2009 17:25:05 +0200
> Nick Piggin <npiggin@suse.de> wrote:
> 
> > 
> > We have had a report of memory allocation hangs during DVD-RAM (UDF) writing.
> > 
> > Jan tracked the cause of this down to UDF inode reclaim blocking:
> > 
> > gnome-screens D ffff810006d1d598     0 20686      1
> >  ffff810006d1d508 0000000000000082 ffff810037db6718 0000000000000800
> >  ffff810006d1d488 ffffffff807e4280 ffffffff807e4280 ffff810006d1a580
> >  ffff8100bccbc140 ffff810006d1a8c0 0000000006d1d4e8 ffff810006d1a8c0
> > Call Trace:
> >  [<ffffffff804477f3>] io_schedule+0x63/0xa5
> >  [<ffffffff802c2587>] sync_buffer+0x3b/0x3f
> >  [<ffffffff80447d2a>] __wait_on_bit+0x47/0x79
> >  [<ffffffff80447dc6>] out_of_line_wait_on_bit+0x6a/0x77
> >  [<ffffffff802c24f6>] __wait_on_buffer+0x1f/0x21
> >  [<ffffffff802c442a>] __bread+0x70/0x86
> >  [<ffffffff88de9ec7>] :udf:udf_tread+0x38/0x3a
> >  [<ffffffff88de0fcf>] :udf:udf_update_inode+0x4d/0x68c
> >  [<ffffffff88de26e1>] :udf:udf_write_inode+0x1d/0x2b
> >  [<ffffffff802bcf85>] __writeback_single_inode+0x1c0/0x394
> >  [<ffffffff802bd205>] write_inode_now+0x7d/0xc4
> >  [<ffffffff88de2e76>] :udf:udf_clear_inode+0x3d/0x53
> >  [<ffffffff802b39ae>] clear_inode+0xc2/0x11b
> >  [<ffffffff802b3ab1>] dispose_list+0x5b/0x102
> >  [<ffffffff802b3d35>] shrink_icache_memory+0x1dd/0x213
> >  [<ffffffff8027ede3>] shrink_slab+0xe3/0x158
> >  [<ffffffff8027fbab>] try_to_free_pages+0x177/0x232
> >  [<ffffffff8027a578>] __alloc_pages+0x1fa/0x392
> >  [<ffffffff802951fa>] alloc_page_vma+0x176/0x189
> >  [<ffffffff802822d8>] __do_fault+0x10c/0x417
> >  [<ffffffff80284232>] handle_mm_fault+0x466/0x940
> >  [<ffffffff8044b922>] do_page_fault+0x676/0xabf
> > 
> > Which blocks with the inode lock held, which then blocks other
> > reclaimers:
> > 
> > X             D ffff81009d47c400     0 17285  14831
> >  ffff8100844f3728 0000000000000086 0000000000000000 ffff81000000e288
> >  ffff81000000da00 ffffffff807e4280 ffffffff807e4280 ffff81009d47c400
> >  ffffffff805ff890 ffff81009d47c740 00000000844f3808 ffff81009d47c740
> > Call Trace:
> >  [<ffffffff80447f8c>] __mutex_lock_slowpath+0x72/0xa9
> >  [<ffffffff80447e1a>] mutex_lock+0x1e/0x22
> >  [<ffffffff802b3ba1>] shrink_icache_memory+0x49/0x213
> >  [<ffffffff8027ede3>] shrink_slab+0xe3/0x158
> >  [<ffffffff8027fbab>] try_to_free_pages+0x177/0x232
> >  [<ffffffff8027a578>] __alloc_pages+0x1fa/0x392
> >  [<ffffffff8029507f>] alloc_pages_current+0xd1/0xd6
> >  [<ffffffff80279ac0>] __get_free_pages+0xe/0x4d
> >  [<ffffffff802ae1b7>] __pollwait+0x5e/0xdf
> >  [<ffffffff8860f2b4>] :nvidia:nv_kern_poll+0x2e/0x73
> >  [<ffffffff802ad949>] do_select+0x308/0x506
> >  [<ffffffff802adced>] core_sys_select+0x1a6/0x254
> >  [<ffffffff802ae0b7>] sys_select+0xb5/0x157
> 
> That isn't a hang.  When the bread() completes, everything proceeds.

Yes sorry bad wording (from our bug report). It isn't a hang, just
realy bad latency and everything stops for a while.

 
> > Now I think the main problem is having the filesystem block (and do IO
> > in inode reclaim. The problem is that this doesn't get accounted well
> > and penalizes a random allocator with a big latency spike caused by
> > work generated from elsewhere.
> 
> Yes.  Why does UDF do all that stuff in ->clear_inode()?  Other
> filesystems have very simple, non-blocking, non-IO-doing
> ->clear_inode() implementations.  This sounds like a design problem
> within UDF.

Yep.

 
> > I think the best idea would be to avoid this. By design if possible,
> > or by deferring the hard work to an asynchronous context. If the latter,
> > then the fs would probably want to throttle creation of new work with
> > queue size of the deferred work, but let's not get into those details.
> > 
> > Anyway, another obvious thing we looked at is the iprune_mutex which
> > is causing the cascading blocking. We could turn this into an rwsem to
> > improve concurrency. It is unreasonable to totally ban all potentially
> > slow or blocking operations in inode reclaim, so I think this is a cheap
> > way to get a small improvement.
> > 
> > This doesn't solve the whole problem of course. The process doing inode
> > reclaim will still take the latency hit, and concurrent processes may
> > end up contending on filesystem locks. So fs developers should keep
> > these problems in mind please (or discuss alternatives).
> > 
> > Jan points out this has the potential to uncover concurrency bugs in fs
> > code.
> > 
> > Comments?
> 
> I bet you found that nice comment over iprune_mutex to be useful, no?
> 
> That comment needs updating by this patch, btw.

Yes will do.

 
> > -	mutex_unlock(&iprune_mutex);
> > +	up_write(&iprune_sem);
> 
> yup, the patch looks OK.  inode_lock protects the lists and each thread
> will make each inode ineligible for lookup by other threads, so it's
> hard to see how there could be races in the VFS code.

I guess dispose_list does a bit of stuff unserialised now. However
most of the same stuff gets done in inode deletion as well, so I
think it is pretty low risk.

next prev parent reply	other threads:[~2009-08-15  4:45 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-08-14 15:25 [rfc][patch] fs: turn iprune_mutex into rwsem Nick Piggin
2009-08-14 22:58 ` Andrew Morton
2009-08-14 23:39   ` Jan Kara
2009-08-15  4:45   ` Nick Piggin [this message]
2009-08-15  5:14   ` Nick Piggin
2009-08-15 19:57 ` Christoph Hellwig
2009-08-16 10:05   ` Nick Piggin
2009-08-16 22:11   ` Andreas Dilger
2009-08-17  6:34     ` Nick Piggin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090815044534.GB19195@wotan.suse.de \
    --to=npiggin@suse.de \
    --cc=akpm@linux-foundation.org \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).