From mboxrd@z Thu Jan  1 00:00:00 1970
From: Curt Wohlgemuth <curtw@google.com>
Subject: Re: [PATCH] writeback: Don't wait for completion in writeback_inodes_sb_nr
Date: Mon, 11 Jul 2011 10:11:17 -0700
Message-ID: <CAO81RMairX_L5qGOcKKpw7gWdcTd2Y+5NL=+BA9EwTCNLCNmLw@mail.gmail.com>
References: <1309304616-8657-1-git-send-email-curtw@google.com>
	<20110629005422.GQ32466@dastard>
	<BANLkTimKtwVZo3D+w9AB2-C83d08YROcMw@mail.gmail.com>
	<20110629081155.GA5558@infradead.org>
	<20110629165714.GF17590@quack.suse.cz>
	<20110629175534.GA32236@infradead.org>
	<20110629191518.GA23196@quack.suse.cz>
	<BANLkTinndb4oHi6kNQrTsL86UrKDB=2rK6-gA8FM1Nj8CKYZ+w@mail.gmail.com>
	<20110711170042.GE5482@quack.suse.cz>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: QUOTED-PRINTABLE
Cc: Christoph Hellwig <hch@infradead.org>,
	Al Viro <viro@zeniv.linux.org.uk>,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	fengguang.wu@intel.com
To: Jan Kara <jack@suse.cz>
Return-path: <linux-kernel-owner@vger.kernel.org>
In-Reply-To: <20110711170042.GE5482@quack.suse.cz>
Sender: linux-kernel-owner@vger.kernel.org
List-Id: linux-fsdevel.vger.kernel.org

Hi Jan:

On Mon, Jul 11, 2011 at 10:00 AM, Jan Kara <jack@suse.cz> wrote:
> =A0Hi Curt,
>
> On Fri 01-07-11 15:55:33, Curt Wohlgemuth wrote:
>> On Wed, Jun 29, 2011 at 12:15 PM, Jan Kara <jack@suse.cz> wrote:
>> > On Wed 29-06-11 13:55:34, Christoph Hellwig wrote:
>> >> On Wed, Jun 29, 2011 at 06:57:14PM +0200, Jan Kara wrote:
>> >> > > For sys_sync I'm pretty sure we could simply remove the
>> >> > > writeback_inodes_sb call and get just as good if not better p=
erformance,
>> >> > =A0 Actually, it won't with current code. Because WB_SYNC_ALL w=
riteback
>> >> > currently has the peculiarity that it looks like:
>> >> > =A0 for all inodes {
>> >> > =A0 =A0 write all inode data
>> >> > =A0 =A0 wait for inode data
>> >> > =A0 }
>> >> > while to achieve good performance we actually need something li=
ke
>> >> > =A0 for all inodes
>> >> > =A0 =A0 write all inode data
>> >> > =A0 for all inodes
>> >> > =A0 =A0 wait for inode data
>> >> > It makes a difference in an order of magnitude when there are l=
ots of
>> >> > smallish files - SLES had a bug like this so I know from user r=
eports ;)
>> >>
>> >> I don't think that's true. =A0The WB_SYNC_ALL writeback is done u=
sing
>> >> sync_inodes_sb, which operates as:
>> >>
>> >> =A0 for all dirty inodes in bdi:
>> >> =A0 =A0 =A0if inode belongs to sb
>> >> =A0 =A0 =A0 =A0 write all inode data
>> >>
>> >> =A0 for all inodes in sb:
>> >> =A0 =A0 =A0wait for inode data
>> >>
>> >> we still do that in a big for each sb loop, though.
>> > =A0True but writeback_single_inode() has in it:
>> > =A0 =A0 =A0 =A0if (wbc->sync_mode =3D=3D WB_SYNC_ALL) {
>> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0int err =3D filemap_fdatawait(mappi=
ng);
>> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0if (ret =3D=3D 0)
>> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0ret =3D err;
>> > =A0 =A0 =A0 =A0}
>> > =A0So we end up waiting much earlier. Probably we should remove th=
is wait
>> > but that will need some audit I guess.
>>
>> So today for WB_SYNC_ALL from sync_inodes_sb(), we do:
>> =A0 - queue a work item; this will
>> =A0 =A0 - write all dirty inodes in a sb
>> =A0 =A0 =A0 - write one inode's pages
>> =A0 =A0 =A0 - wait on all inode's pages
>> =A0 - wait for the work item
>> =A0 - wait on all inodes in the sb (wait_sb_inodes)
>>
>> I guess the point of wait_sb_inodes() is to wait on all inodes that
>> were written in a previous writeback pass.
> =A0Yes. As we discussed with Christoph, waiting for each inode could =
be
> avoided in principle but we have to be careful not to break data
> consistency guarantees in some other path calling
> writeback_single_inode()...
>
>> One other issue I have with sync as it's structured is that we don't
>> do a WB_SYNC_ALL pass on any inode that's only associated with a blo=
ck
>> device, and not on a mounted filesystem. =A0Blockdev mounts are
>> pseudo-mounts, and are explicitly skipped in __sync_filesystem(). =A0=
So
>> if you've written directly to a block device and do a sync, the only
>> pass over the pages for this inode are via the
>> wakeup_flusher_threads() -- which operates on a BDI, regardless of t=
he
>> superblock, and uses WB_SYNC_NONE.
>>
>> All the sync_filesystem() calls are per-sb, not per-BDI, and they'll
>> exclude pseudo-superblocks.
>>
>> I've seen cases in our modified kernels here at Google in which
>> lilo/shutdown failed because of a lack of WB_SYNC_ALL writeback for
>> /dev/sda (though I haven't been able to come up with a consistent te=
st
>> case, nor reproduce this on an upstream kernel).
> =A0Ho hum, I wonder what would be the cleanest way to fix this. I gue=
ss we
> could make an exception for 'bdev' superblock in the code but it's ug=
ly.

Yes, it's ugly.

You could declare that sync(2) is just doing it's job, that it's not
designed to write dirty pages from devices that don't have filesystems
mounted on them; that's what Christoph seems to be saying, and it's
what the man pages for sync(2) that I've seen say as well.  But
everybody I've talked to about this is surprised, so you could declare
it a bug as well :-) .

It seems to me that sys_sync *could* just dispense with iterating over
the superblocks, and just iterate over the BDIs, just as
wakeup_flusher_threads() does.  I.e., just do two passes over all BDIs
-- one with WB_SYNC_NONE, and one with WB_SYNC_ALL.  Well, not quite.
It would still have to go to each SB and call the quota_sync and
sync_fs callbacks, but those should be cheap.

And yes, this would make sys_sync and sys_syncfs more different than
they are today.

Thanks,
Curt

> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0Honza
> --
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR
>