Re: [PATCH] Optimize wait_sb_inodes()

All of lore.kernel.org
 help / color / mirror / Atom feed

From: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-fsdevel@vger.kernel.org, tux3@tux3.org,
	Al Viro <viro@zeniv.linux.org.uk>,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH] Optimize wait_sb_inodes()
Date: Thu, 27 Jun 2013 16:22:04 +0900	[thread overview]
Message-ID: <87ppv83tnn.fsf@devron.myhome.or.jp> (raw)
In-Reply-To: <20130627063816.GD29790@dastard> (Dave Chinner's message of "Thu, 27 Jun 2013 16:38:16 +1000")

Dave Chinner <david@fromorbit.com> writes:

>> Otherwise, vfs can't know the data is whether after sync point or before
>> sync point, and have to wait or not. FS is using the behavior like
>> data=journal has tracking of those already, and can reuse it.
>
> The VFS writeback code already differentiates between data written
> during a sync operation and that dirtied after a sync operation.
> Perhaps you should look at the tagging for WB_SYNC_ALL writeback
> that write_cache_pages does....
>
> But, anyway, we don't have to do that on the waiting side of things.
> All we need to do is add the inode to a "under IO" list on the bdi
> when the mapping is initially tagged with pages under writeback,
> and remove it from that list during IO completion when the mapping
> is no longer tagged as having pages under writeback.
>
> wait_sb_inodes() just needs to walk that list and wait on each inode
> to complete IO. It doesn't require *any awareness* of the underlying
> filesystem implementation or how the IO is actually issued - if
> there's IO in progress at the time wait_sb_inodes() is called, then
> it waits for it.
>
>> > Fix the root cause of the problem - the sub-optimal VFS code.
>> > Hacking around it specifically for out-of-tree code is not the way
>> > things get done around here...
>> 
>> I'm thinking the root cause is vfs can't have knowledge of FS internal,
>> e.g. FS is handling data transactional way, or not.
>
> If the filesystem has transactional data/metadata that the VFS is
> not tracking, then that is what the ->sync_fs call is for. i.e. so
> the filesystem can then do what ever extra writeback/waiting it
> needs to do that the VFS is unaware of.
>
> We already cater for what Tux3 needs in the VFS - all you've done is
> found an inefficient algorithm that needs fixing.

write_cache_pages() is library function to be called from per-FS. So, it
is not under vfs control can be assume already. And it doesn't do right
things via write_cache_pages() for data=journal, because it handles for
each inodes, not at once. So, new dirty data can be inserted while
marking.

Thanks.
-- 
OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>

WARNING: multiple messages have this Message-ID (diff)

From: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
To: Dave Chinner <david@fromorbit.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	tux3@tux3.org
Subject: Re: [PATCH] Optimize wait_sb_inodes()
Date: Thu, 27 Jun 2013 16:22:04 +0900	[thread overview]
Message-ID: <87ppv83tnn.fsf@devron.myhome.or.jp> (raw)
In-Reply-To: <20130627063816.GD29790@dastard> (Dave Chinner's message of "Thu, 27 Jun 2013 16:38:16 +1000")

Dave Chinner <david@fromorbit.com> writes:

>> Otherwise, vfs can't know the data is whether after sync point or before
>> sync point, and have to wait or not. FS is using the behavior like
>> data=journal has tracking of those already, and can reuse it.
>
> The VFS writeback code already differentiates between data written
> during a sync operation and that dirtied after a sync operation.
> Perhaps you should look at the tagging for WB_SYNC_ALL writeback
> that write_cache_pages does....
>
> But, anyway, we don't have to do that on the waiting side of things.
> All we need to do is add the inode to a "under IO" list on the bdi
> when the mapping is initially tagged with pages under writeback,
> and remove it from that list during IO completion when the mapping
> is no longer tagged as having pages under writeback.
>
> wait_sb_inodes() just needs to walk that list and wait on each inode
> to complete IO. It doesn't require *any awareness* of the underlying
> filesystem implementation or how the IO is actually issued - if
> there's IO in progress at the time wait_sb_inodes() is called, then
> it waits for it.
>
>> > Fix the root cause of the problem - the sub-optimal VFS code.
>> > Hacking around it specifically for out-of-tree code is not the way
>> > things get done around here...
>> 
>> I'm thinking the root cause is vfs can't have knowledge of FS internal,
>> e.g. FS is handling data transactional way, or not.
>
> If the filesystem has transactional data/metadata that the VFS is
> not tracking, then that is what the ->sync_fs call is for. i.e. so
> the filesystem can then do what ever extra writeback/waiting it
> needs to do that the VFS is unaware of.
>
> We already cater for what Tux3 needs in the VFS - all you've done is
> found an inefficient algorithm that needs fixing.

write_cache_pages() is library function to be called from per-FS. So, it
is not under vfs control can be assume already. And it doesn't do right
things via write_cache_pages() for data=journal, because it handles for
each inodes, not at once. So, new dirty data can be inserted while
marking.

Thanks.
-- 
OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>

next prev parent reply	other threads:[~2013-06-27  7:22 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-06-26  8:45 [PATCH] Optimize wait_sb_inodes() OGAWA Hirofumi
2013-06-26  8:45 ` OGAWA Hirofumi
2013-06-26 14:32 ` Jörn Engel
2013-06-27  0:01   ` OGAWA Hirofumi
2013-06-27  0:01     ` OGAWA Hirofumi
2013-06-26 23:11 ` Dave Chinner
2013-06-27  0:14   ` OGAWA Hirofumi
2013-06-27  0:14     ` OGAWA Hirofumi
2013-06-27  4:47     ` Dave Chinner
2013-06-27  5:18       ` OGAWA Hirofumi
2013-06-27  5:18         ` OGAWA Hirofumi
2013-06-27  6:38         ` Dave Chinner
2013-06-27  7:22           ` OGAWA Hirofumi [this message]
2013-06-27  7:22             ` OGAWA Hirofumi
2013-06-27  9:40             ` Dave Chinner
2013-06-27 10:06               ` OGAWA Hirofumi
2013-06-27 10:06                 ` OGAWA Hirofumi
2013-06-27 18:36                 ` Theodore Ts'o
2013-06-27 23:37                   ` OGAWA Hirofumi
2013-06-27 23:37                     ` OGAWA Hirofumi
2013-06-27 23:45                     ` Theodore Ts'o
2013-06-28  0:30                       ` OGAWA Hirofumi
2013-06-28  0:30                         ` OGAWA Hirofumi
2013-06-28  5:23                         ` Dave Chinner
2013-06-28  7:39                           ` OGAWA Hirofumi
2013-06-28  7:39                             ` OGAWA Hirofumi
2013-06-28  3:00                   ` Daniel Phillips
2013-06-27  7:22         ` Daniel Phillips
2013-06-27  5:50       ` Daniel Phillips
2013-06-27  5:50         ` Daniel Phillips

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87ppv83tnn.fsf@devron.myhome.or.jp \
    --to=hirofumi@mail.parknet.co.jp \
    --cc=david@fromorbit.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=tux3@tux3.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.