From: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
To: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org, tux3@tux3.org,
linux-kernel@vger.kernel.org
Subject: [PATCH] Optimize wait_sb_inodes()
Date: Wed, 26 Jun 2013 17:45:23 +0900 [thread overview]
Message-ID: <87ehbpntuk.fsf@devron.myhome.or.jp> (raw)
Hi,
On the following stress, sync(2) became top of cpu load.
fsstress -s 1 -n 50000 -p 3 -d `pwd`
After profile by perf, cpu load was lock contention of inode_sb_list_lock.
sync(2) is data integrity operation, so it has to make sure all dirty
data was written before sync(2) point. The bdi flusher flushes current
dirty data and wait those. But, if there is in-flight I/O before
sync_inodes_sb(), sync_inodes_sb() can be done before prior in-flight I/O.
So, wait_sb_inodes() is walking the all inodes on sb to make sure
in-flight I/O was done too. When it is walking inodes,
inode_sb_list_lock is contending with other operations like
create(2). This is the cause of cpu load.
On another view, wait_sb_inodes() would (arguably) be necessary for
legacy FSes. But, for example, if data=journal on ext*, wait_sb_inodes()
would be more than useless, because ext* can be done it by own
transaction list (and more efficient way).
Likewise, on tux3, the state is same with data=journal.
Also, even if data=ordered, ext* might be able to check in-flight I/O by
ordered data list (with some new additional check, I'm not sure).
So, this patch adds the sb->s_op->wait_inodes() handler to replace
wait_sb_inodes(). With this, FSes can optimize it by using own
internal knowledge.
[Alternative idea to optimize this, push down wait_sb_inodes() into
->sync_fs() handler on all FSes. Or if someone fixes this with another
idea, I'm happy enough.]
The following is profile of result on tux3 (->wait_inodes() handler is
noop function).
Thanks.
- 13.11% fsstress [kernel.kallsyms] [k] sync_inodes_sb
- sync_inodes_sb
- 99.97% sync_inodes_one_sb
iterate_supers
sys_sync
system_call
sync
- 9.39% fsstress [kernel.kallsyms] [k] _raw_spin_lock
- _raw_spin_lock
+ 62.72% sync_inodes_sb
+ 7.46% _atomic_dec_and_lock
+ 6.15% inode_add_lru
+ 4.43% map_region
+ 3.07% __find_get_buffer
+ 2.71% sync_inodes_one_sb
+ 1.77% tux3_set_buffer_dirty_list
+ 1.43% find_inode
+ 1.02% iput
+ 0.69% dput
+ 0.57% __tux3_get_block
+ 0.53% shrink_dcache_parent
+ 0.53% __d_instantiate
Before patch:
real 2m40.994s
user 0m1.344s
sys 0m15.832s
After patch:
real 2m34.748
user 0m1.360s
sys 0m5.356s
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
---
fs/fs-writeback.c | 5 ++++-
include/linux/fs.h | 1 +
2 files changed, 5 insertions(+), 1 deletion(-)
diff -puN include/linux/fs.h~tux3-fix-wait_sb_inodes include/linux/fs.h
--- tux3fs/include/linux/fs.h~tux3-fix-wait_sb_inodes 2013-06-24 19:03:10.000000000 +0900
+++ tux3fs-hirofumi/include/linux/fs.h 2013-06-24 19:03:10.000000000 +0900
@@ -1595,6 +1595,7 @@ struct super_operations {
int (*write_inode) (struct inode *, struct writeback_control *wbc);
int (*drop_inode) (struct inode *);
void (*evict_inode) (struct inode *);
+ void (*wait_inodes)(struct super_block *);
void (*put_super) (struct super_block *);
int (*sync_fs)(struct super_block *sb, int wait);
int (*freeze_fs) (struct super_block *);
diff -puN fs/fs-writeback.c~tux3-fix-wait_sb_inodes fs/fs-writeback.c
--- tux3fs/fs/fs-writeback.c~tux3-fix-wait_sb_inodes 2013-06-24 19:03:10.000000000 +0900
+++ tux3fs-hirofumi/fs/fs-writeback.c 2013-06-24 19:03:10.000000000 +0900
@@ -1401,7 +1401,10 @@ void sync_inodes_sb(struct super_block *
bdi_queue_work(sb->s_bdi, &work);
wait_for_completion(&done);
- wait_sb_inodes(sb);
+ if (sb->s_op->wait_inodes)
+ sb->s_op->wait_inodes(sb);
+ else
+ wait_sb_inodes(sb);
}
EXPORT_SYMBOL(sync_inodes_sb);
_
--
OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
next reply other threads:[~2013-06-26 8:45 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-06-26 8:45 OGAWA Hirofumi [this message]
2013-06-26 14:32 ` [PATCH] Optimize wait_sb_inodes() Jörn Engel
2013-06-27 0:01 ` OGAWA Hirofumi
2013-06-26 23:11 ` Dave Chinner
2013-06-27 0:14 ` OGAWA Hirofumi
2013-06-27 4:47 ` Dave Chinner
2013-06-27 5:18 ` OGAWA Hirofumi
2013-06-27 6:38 ` Dave Chinner
2013-06-27 7:22 ` OGAWA Hirofumi
2013-06-27 9:40 ` Dave Chinner
2013-06-27 10:06 ` OGAWA Hirofumi
2013-06-27 18:36 ` Theodore Ts'o
2013-06-27 23:37 ` OGAWA Hirofumi
2013-06-27 23:45 ` Theodore Ts'o
2013-06-28 0:30 ` OGAWA Hirofumi
2013-06-28 5:23 ` Dave Chinner
2013-06-28 7:39 ` OGAWA Hirofumi
2013-06-28 3:00 ` Daniel Phillips
2013-06-27 7:22 ` Daniel Phillips
2013-06-27 5:50 ` Daniel Phillips
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87ehbpntuk.fsf@devron.myhome.or.jp \
--to=hirofumi@mail.parknet.co.jp \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=tux3@tux3.org \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).