Re: [PATCH] Optimize wait_sb_inodes()

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Dave Chinner <david@fromorbit.com>
To: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Al Viro <viro@zeniv.linux.org.uk>,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	tux3@tux3.org
Subject: Re: [PATCH] Optimize wait_sb_inodes()
Date: Thu, 27 Jun 2013 09:11:43 +1000	[thread overview]
Message-ID: <20130626231143.GC28426@dastard> (raw)
In-Reply-To: <87ehbpntuk.fsf@devron.myhome.or.jp>

On Wed, Jun 26, 2013 at 05:45:23PM +0900, OGAWA Hirofumi wrote:
> Hi,
> 
> On the following stress, sync(2) became top of cpu load.
> 
> 	fsstress -s 1 -n 50000 -p 3 -d `pwd`
> 
> After profile by perf, cpu load was lock contention of inode_sb_list_lock.
> 
> sync(2) is data integrity operation, so it has to make sure all dirty
> data was written before sync(2) point. The bdi flusher flushes current
> dirty data and wait those. But, if there is in-flight I/O before
> sync_inodes_sb(), sync_inodes_sb() can be done before prior in-flight I/O.
> 
> So, wait_sb_inodes() is walking the all inodes on sb to make sure
> in-flight I/O was done too. When it is walking inodes,
> inode_sb_list_lock is contending with other operations like
> create(2). This is the cause of cpu load.

Sure, but avoiding wait_sb_inodes() doesn't address the fast-path
lock contention inode_sb_list_lock has which is in the create and
evict paths. The problems caused by wait_sb_inodes() are background
noise compared to contention multiple CPUs can put on this lock just
walking large directory structures in parallel.  Hence hacking
around wait_sb_inodes() doesn't address the underlying lock
contention problem.

> On another view, wait_sb_inodes() would (arguably) be necessary for
> legacy FSes. But, for example, if data=journal on ext*, wait_sb_inodes()
> would be more than useless, because ext* can be done it by own
> transaction list (and more efficient way).
> 
> Likewise, on tux3, the state is same with data=journal.
> 
> Also, even if data=ordered, ext* might be able to check in-flight I/O by
> ordered data list (with some new additional check, I'm not sure).

Why would you bother solving this problem differently in every
single filesystem? It's solvable at the VFS by tracking inodes that
are no longer dirty but still under writeback on the BDI. Then
converting wait_sb_inodes() to walk all the dirty and writeback
inodes would be sufficient for data integrity purposes, and it would
be done under the bdi writeback lock, not the inode_sb_list_lock....

Alternatively, splitting up the inode sb list and lock (say via the
per-node list_lru structures in -mm and -next that are being added
for exactly this purpose) would also significantly reduce lock
contention on both the create/evict fast paths and the
wait_sb_inodes() walk that is currently done....

So I think that you should address the problem properly at the VFS
level so everyone benefits, not push interfaces that allow
filesystem specific hacks to work around VFS level deficiencies...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

next prev parent reply	other threads:[~2013-06-26 23:11 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-06-26  8:45 [PATCH] Optimize wait_sb_inodes() OGAWA Hirofumi
2013-06-26 14:32 ` Jörn Engel
2013-06-27  0:01   ` OGAWA Hirofumi
2013-06-26 23:11 ` Dave Chinner [this message]
2013-06-27  0:14   ` OGAWA Hirofumi
2013-06-27  4:47     ` Dave Chinner
2013-06-27  5:18       ` OGAWA Hirofumi
2013-06-27  6:38         ` Dave Chinner
2013-06-27  7:22           ` OGAWA Hirofumi
2013-06-27  9:40             ` Dave Chinner
2013-06-27 10:06               ` OGAWA Hirofumi
2013-06-27 18:36                 ` Theodore Ts'o
2013-06-27 23:37                   ` OGAWA Hirofumi
2013-06-27 23:45                     ` Theodore Ts'o
2013-06-28  0:30                       ` OGAWA Hirofumi
2013-06-28  5:23                         ` Dave Chinner
2013-06-28  7:39                           ` OGAWA Hirofumi
2013-06-28  3:00                   ` Daniel Phillips
2013-06-27  7:22         ` Daniel Phillips
2013-06-27  5:50       ` Daniel Phillips

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130626231143.GC28426@dastard \
    --to=david@fromorbit.com \
    --cc=hirofumi@mail.parknet.co.jp \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=tux3@tux3.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).