public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Jan Kara <jack@suse.cz>
Cc: linux-fsdevel@vger.kernel.org,
	OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>,
	Wu Fengguang <fengguang.wu@intel.com>
Subject: Re: [PATCH 14/14] writeback: Per-sb dirty tracking
Date: Mon, 11 Aug 2014 09:16:51 +1000	[thread overview]
Message-ID: <20140810231651.GO20518@dastard> (raw)
In-Reply-To: <20140808104619.GD2398@quack.suse.cz>

On Fri, Aug 08, 2014 at 12:46:19PM +0200, Jan Kara wrote:
> On Thu 07-08-14 07:13:05, Dave Chinner wrote:
> > On Wed, Aug 06, 2014 at 10:46:39AM +0200, Jan Kara wrote:
> > > On Wed 06-08-14 09:44:16, Dave Chinner wrote:
> > > > On Fri, Aug 01, 2014 at 12:00:53AM +0200, Jan Kara wrote:
> > > > > Switch inode dirty tracking lists to be per superblock instead of per
> > > > > bdi. This is a major step towards filesystems being able to do their
> > > > > own dirty tracking and selection of inodes for writeback if they desire
> > > > > so (e.g. because they journal or COW data and need to writeback inodes
> > > > > & pages in a specific order unknown to generic writeback code).
> > > > > 
> > > > > Per superblock dirty lists also make selecting inodes for writeback
> > > > > somewhat simpler because we don't have to search for inodes from a
> > > > > particular superblock for some kinds of writeback (OTOH we pay for this
> > > > > by having to iterate through superblocks for all-bdi type of writeback)
> > > > > and this simplification will allow for an easier switch to a better
> > > > > scaling data structure for dirty inodes.
> > ....
> > > > I'm not sure I really like this code very much - it seems to be
> > > > muchmore complex than it needs to be because writeback is still
> > > > managed on a per-bdi basis and the sb iteration is pretty clunky.
> > > > If we are moving to per-sb inode tracking, we should also move all
> > > > the writeback management to per-sb as well.
> > > > 
> > > > IMO, there's no good reason for keeping flusher threads per-bdi and
> > > > then having to iterate per-sb just to do background/periodic
> > > > writeback, and then have special cases for sb specific writeback
> > > > that avoids the bdi per-sb looping. i.e. per-sb flush work executed
> > > > by a bdi flusher thread makes a lot more sense than per-bdi
> > > > flush work that iterates superblocks.
> > > > 
> > > > So for the moment, I think this patch makes things worse rather than
> > > > better. I'd much prefer to see a single series that moves from per-bdi
> > > > tracking/writeback to per-sb tracking/writeback than to split the
> > > > tracking/writeback changes and then have to support an weird,
> > > > temporary, intermediate code base like this...
> > >   So when writing this series I was thinking about both possibilities -
> > > i.e., keeping per-bdi threads and changing to per-sb threads. In the end
> > > I've decided to start with keeping per-bdi threads and seeing how things
> > > work out.
> > 
> > > Regarding per-sb threads I have two unresolved questions:
> > > 
> > > 1) How to handle block device inodes - when filesystem is mounted on the
> > > block device, it would be natural to writeback such inode together with
> > > other filesystem's inodes. When filesystem isn't mounted on the block
> > > device we have no superblock to attach such inode to so we would have to
> > > have some flusher for the virtual blkdev superblock that will deal with
> > > specifics of block device superblock.
> > 
> > You mean for the blockdev_superblock? That's probably a good idea;
> > it would isolate the special case blockdev inode writeback in it's
> > own little function. i.e. this is crying out for a superblock
> > specific writeback method (just like tux3 is asking for) that simply
> > does:
> > 
> > 	list_for_each_entry(inode, &blockdev_superblock->s_inodes, i_sb_list) {
> > 		if (work->sync == WB_SYNC_ALL)
> > 			__sync_blockdev(I_BDEV(inode), 1);
> > 		else
> > 			__sync_blockdev(I_BDEV(inode), 0);
> > 	}
>   Sadly, it's not that easy - you have to pass in struct writeback_control
> so that writing obeys nr_to_write - otherwise you can livelock doing
> writeback and preemption of background works won't work either. So you have
> to do something along the lines of what writeback_inodes() does...

Sure, the above is just an illustration of the concept, not a
complete solution. indeed, the only real work __sync_blockdev() ends
up doing is do_writepages/filemap_fdatawait. We don't really care
what the inode dirty flags are - if there's nothing dirty then
do_writepages won't issue IO, but filemap_fdatawait() will still
wait on any IO in progress. Hence a real loop looks more like
iterate_bdevs() and ends up looking somewhat like:

	list_for_each_entry(inode, &blockdev_superblock->s_inodes, i_sb_list) {
		/* get inode references, drop list locks */

		if (bdi_wb_need_preempt(bdi, work))
			break;

		if (!bdi_wb_need_background(bdi, work))
			break;

		do_writepages(inode->i_mapping, wbc);
		if (wbc->sync == WB_SYNC_ALL)
			filemap_fdatawait(inode->i_mapping);

		/* drop references, lock lists */
	}

I can't see why it needs to be any more complex - syncing a blockdev
is actually really damn simple. That simplicity is actually hidden
by all the other stuff in the inode writeback path needed for
filesystem inodes. Again, this is just psuedo code, so treat it as
such....

IMO, the point of moving to per-sb writeback implementations is to
be able to make code that should be simple actually be simple.
Eveything is tangled up at the moment, so it's really hard to see
what should be simple and what actually needs to be complex.

> Also I don't really see how to get rid of specialcasing block devices.
> Normal superblocks are attached to a bdi - bdi flusher iterates all
> superblocks on that bdi to do background writeback of the bdi. However
> block device superblock is special - inodes of that superblock belong to
> different bdis so you cannot just attach that superblock to any particular
> bdi.

Then attach a psuedo BDI to the psuedo superblock list. We only need
a BDI to provide a flusher thread and scheduling of the flush
work....

> That's why I created a special "writeback queue" for each bdi embedded
> in struct backing_dev_info and use that for tracking writeback state of
> block device inodes belonging to the bdi.

AFAIC, that's a solution required by not abstracting the code in the
best manner.

> > For data integrity operations, the higher layers call
> > sync_blockdev() appropriately so we don't actually have to care
> > about blockdev inodes during per-sb writeback at all.
>   Agreed.
> 
> > With that, we get rid of another longstanding, ugly wart in the
> > writeback code.
> > 
> > > 2) How to deal with multiple superblocks per device - and here I'm
> > > convinced we should not regress writeback performance of the case where
> > > disk is partitioned into several partitions. And I think this will require
> > > some kind of synchronization between per-sb threads on the same device.
> > 
> > Right, that's why I suggested this:
> > 
> > > > i.e. per-sb flush work executed
> > > > by a bdi flusher thread makes a lot more sense than per-bdi
> > > > flush work that iterates superblocks.
>   Aha, sorry, I misunderstood this sentense.
> 
> > What I mean is that every work that is executed on a BDI has
> > work->sb set. Functions like wb_check_background_flush and
> > wb_check_background_flush no longer take just the bdi, they also
> > take the sb they are going to operate on so it canbe stuffed into
> > the work structure that is passed to bdi_writeback(). At this point,
> > bdi_writeback() only needs to operate on a single sb.
> > 
> > We still have all the same "periodic/background writeback is aborted
> > if new work is queued" logic so that data integrity operations
> > preempt background work, so there is no difference in behaviour
> > there.  i.e. the bdi still manages the interactions between all the
> > superblocks, just from a slightly different perspective: we
> > explicitly schedule writeback by superblock rather than by bdi.
>   OK, this is an interesting idea. However suppose you have two active
> filesystems on a bdi. Now you have to somehow switch between these two
> filesystems - I don't think you can switch to the second filesystem after
> you clean the first one because that may starve the second filesystem of
> background writeback for a long time and in the extreme case it may never
> happen. But I suppose if we set nr_to_write sensibly we could avoid such
> problems... I'll try this and see how it looks in the code.

Yeah, nr_to_write management should solve this - we already have a
bandwidth estimation for the device. making the per-sb writeback
slice be (BW / nr_sb_on_bdi) would give a roughly fair behaviour
by only allowing a single background writeback execution it's fair
share of the BW. If other SBs are idle, we immediately run the
current SB again with a new share....

FWIW, we already abort background writeback on a multi-sb BDI when a
work->sb work item comes in (e.g. from sync()/syncfs()). hence we
can already starve background/periodic writeback on certain sb's on
such a BDI simply by constantly dirtying a single sb and calling
syncfs() in a loop. Hence cross-sb background writeback starvation
is already an issue. I suspect that moving to per-sb scheduling will
allow us to note per-emption more easily, and if starvation is
occuring allow background writeback to ignore pre-emption for a
nr_to_write slice periodically...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  reply	other threads:[~2014-08-10 23:16 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-07-31 22:00 [RFC PATCH 00/14] Per-sb tracking of dirty inodes Jan Kara
2014-07-31 22:00 ` [PATCH 01/14] writeback: Get rid of superblock pinning Jan Kara
2014-07-31 22:00 ` [PATCH 02/14] writeback: Remove writeback_inodes_wb() Jan Kara
2014-07-31 22:00 ` [PATCH 03/14] writeback: Remove useless argument of writeback_single_inode() Jan Kara
2014-07-31 22:00 ` [PATCH 04/14] writeback: Don't put inodes which cannot be written to b_more_io Jan Kara
2014-07-31 22:00 ` [PATCH 05/14] writeback: Move dwork and last_old_flush into backing_dev_info Jan Kara
2014-07-31 22:00 ` [PATCH 06/14] writeback: Switch locking of bandwidth fields to wb_lock Jan Kara
2014-07-31 22:00 ` [PATCH 07/14] writeback: Provide a function to get bdi from bdi_writeback Jan Kara
2014-07-31 22:00 ` [PATCH 08/14] writeback: Schedule future writeback if bdi (not wb) has dirty inodes Jan Kara
2014-07-31 22:00 ` [PATCH 09/14] writeback: Switch some function arguments from bdi_writeback to bdi Jan Kara
2014-07-31 22:00 ` [PATCH 10/14] writeback: Move rechecking of work list into bdi_process_work_items() Jan Kara
2014-07-31 22:00 ` [PATCH 11/14] writeback: Shorten list_lock hold times in bdi_writeback() Jan Kara
2014-07-31 22:00 ` [PATCH 12/14] writeback: Move refill of b_io list into writeback_inodes() Jan Kara
2014-07-31 22:00 ` [PATCH 13/14] writeback: Comment update Jan Kara
2014-07-31 22:00 ` [PATCH 14/14] writeback: Per-sb dirty tracking Jan Kara
2014-08-01  5:14   ` Daniel Phillips
2014-08-05 23:44   ` Dave Chinner
2014-08-06  8:46     ` Jan Kara
2014-08-06 21:13       ` Dave Chinner
2014-08-08 10:46         ` Jan Kara
2014-08-10 23:16           ` Dave Chinner [this message]
2014-08-01  5:32 ` [RFC PATCH 00/14] Per-sb tracking of dirty inodes Daniel Phillips
2014-08-05  5:22 ` Dave Chinner
2014-08-05 10:31   ` Jan Kara
2014-08-05  8:20 ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140810231651.GO20518@dastard \
    --to=david@fromorbit.com \
    --cc=fengguang.wu@intel.com \
    --cc=hirofumi@mail.parknet.co.jp \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox