Re: [RFC PATCH 00/14] Per-sb tracking of dirty inodes

public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed

From: Dave Chinner <david@fromorbit.com>
To: Jan Kara <jack@suse.cz>
Cc: linux-fsdevel@vger.kernel.org,
	OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>,
	Wu Fengguang <fengguang.wu@intel.com>
Subject: Re: [RFC PATCH 00/14] Per-sb tracking of dirty inodes
Date: Tue, 5 Aug 2014 15:22:17 +1000	[thread overview]
Message-ID: <20140805052217.GD20518@dastard> (raw)
In-Reply-To: <1406844053-25982-1-git-send-email-jack@suse.cz>

On Fri, Aug 01, 2014 at 12:00:39AM +0200, Jan Kara wrote:
>   Hello,
> 
>   here is my attempt to implement per superblock tracking of dirty inodes.
> I have two motivations for this:
>   1) I've tried to get rid of overwriting of inode's dirty time stamp during
>      writeback and filtering of dirty inodes by superblock makes this
>      significantly harder. For similar reasons also improving scalability
>      of inode dirty tracking is more complicated than it has to be.
>   2) Filesystems like Tux3 (but to some extent also XFS) would like to
>      influence order in which inodes are written back. Currently this isn't
>      possible. Tracking dirty inodes per superblock makes it easy to later
>      implement filesystem callback for writing back inodes and also possibly
>      allow filesystems to implement their own dirty tracking if they desire.
> 
>   The patches pass xfstests run and also some sync livelock avoidance tests
> I have with 4 filesystems on 2 disks so they should be reasonably sound.
> Before I go and base more work on this I'd like to hear some feedback about
> whether people find this sane and workable.
> 
> After this patch set it is trivial to provide a per-sb callback for writeback
> (at level of writeback_inodes()). It is also fairly easy to allow filesystem to
> completely override dirty tracking (only needs some restructuring of
> mark_inode_dirty()). I can write these as a proof-of-concept patches for Tux3
> guys once the general approach in this patch set is acked. Or if there are
> some in-tree users (XFS?, btrfs?) I can include them in the patch set.
> 
> Any comments welcome!

My initial performance tests haven't shown any regressions, but
those same tests show that we still need to add plugging to
writeback_inodes(). Patch with numbers below. I haven't done any
sanity testing yet - I'll do that over the next few days...

FWIW, the patch set doesn't solve the sync lock contention problems -
populate all of memory with a millions of inodes on a mounted
filesystem, then run xfs/297 on a different filesystem. The system
will trigger major contention in sync_inodes_sb() and
inode_sb_list_add() on the inode_sb_list_lock because xfs/297 will
cause lots of concurrent sync() calls to occur. The system will
perform really badly on anything filesystem related while this
contention occurs. Normally xfs/297 runs in 36s on the machine I
just ran this test on, with the extra cached inodes it's been
running for 15 minutes burning 8-9 CPU cores and there's no end in
sight....

I guess I should dig out my patchset to fix that and port it on top
of this one....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

writeback: plug writeback at a high level

From: Dave Chinner <dchinner@redhat.com>

tl;dr: 3 lines of code, 86% better fsmark thoughput consuming 13%
less CPU and 43% lower runtime.

Doing writeback on lots of little files causes terrible IOPS storms
because of the per-mapping writeback plugging we do. This
essentially causes imeediate dispatch of IO for each mapping,
regardless of the context in which writeback is occurring.

IOWs, running a concurrent write-lots-of-small 4k files using fsmark
on XFS results in a huge number of IOPS being issued for data
writes.  Metadata writes are sorted and plugged at a high level by
XFS, so aggregate nicely into large IOs.

However, data writeback IOs are dispatched in individual 4k IOs -
even when the blocks of two consecutively written files are
adjacent - because the underlying block device is fast enough not to
congest on such IO. This behaviour is not SSD related - anything
with hardware caches is going to see the same benefits as the IO
rates are limited only by how fast adjacent IOs can be sent to the
hardware caches for aggregation.

Hence the speed of the physical device is irrelevant to this common
writeback workload (happens every time you untar a tarball!) -
performance is limited by the overhead of dispatching individual
IOs from a single writeback thread.

Test VM: 16p, 16GB RAM, 2xSSD in RAID0, 500TB sparse XFS filesystem,
metadata CRCs enabled.

Test:

$ ./fs_mark  -D  10000  -S0  -n  10000  -s  4096  -L  120  -d
/mnt/scratch/0  -d  /mnt/scratch/1  -d  /mnt/scratch/2  -d
/mnt/scratch/3  -d  /mnt/scratch/4  -d  /mnt/scratch/5  -d
/mnt/scratch/6  -d  /mnt/scratch/7

Result:
		wall	sys	create rate	Physical write IO
		time	CPU	(avg files/s)	 IOPS	Bandwidth
		-----	-----	-------------	------	---------
unpatched	5m54s	15m32s	32,500+/-2200	28,000	150MB/s
patched		3m19s	13m28s	52,900+/-1800	 1,500	280MB/s
improvement	-43.8%	-13.3%	  +62.7%	-94.6%	+86.6%

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/fs-writeback.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index e80d1b9..2e80e80 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -592,6 +592,9 @@ static long writeback_inodes(struct bdi_writeback *wb,
 	unsigned long start_time = jiffies;
 	long write_chunk;
 	long wrote = 0;  /* count both pages and inodes */
+	struct blk_plug plug;
+
+	blk_start_plug(&plug);

 	spin_lock(&wb->list_lock);
 	if (list_empty(&wb->b_io))
@@ -681,6 +684,8 @@ static long writeback_inodes(struct bdi_writeback *wb,
 		wb->state |= WB_STATE_STALLED;
 	spin_unlock(&wb->list_lock);

+	blk_finish_plug(&plug);
+
 	return wrote;
 }

next prev parent reply	other threads:[~2014-08-05  5:22 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-07-31 22:00 [RFC PATCH 00/14] Per-sb tracking of dirty inodes Jan Kara
2014-07-31 22:00 ` [PATCH 01/14] writeback: Get rid of superblock pinning Jan Kara
2014-07-31 22:00 ` [PATCH 02/14] writeback: Remove writeback_inodes_wb() Jan Kara
2014-07-31 22:00 ` [PATCH 03/14] writeback: Remove useless argument of writeback_single_inode() Jan Kara
2014-07-31 22:00 ` [PATCH 04/14] writeback: Don't put inodes which cannot be written to b_more_io Jan Kara
2014-07-31 22:00 ` [PATCH 05/14] writeback: Move dwork and last_old_flush into backing_dev_info Jan Kara
2014-07-31 22:00 ` [PATCH 06/14] writeback: Switch locking of bandwidth fields to wb_lock Jan Kara
2014-07-31 22:00 ` [PATCH 07/14] writeback: Provide a function to get bdi from bdi_writeback Jan Kara
2014-07-31 22:00 ` [PATCH 08/14] writeback: Schedule future writeback if bdi (not wb) has dirty inodes Jan Kara
2014-07-31 22:00 ` [PATCH 09/14] writeback: Switch some function arguments from bdi_writeback to bdi Jan Kara
2014-07-31 22:00 ` [PATCH 10/14] writeback: Move rechecking of work list into bdi_process_work_items() Jan Kara
2014-07-31 22:00 ` [PATCH 11/14] writeback: Shorten list_lock hold times in bdi_writeback() Jan Kara
2014-07-31 22:00 ` [PATCH 12/14] writeback: Move refill of b_io list into writeback_inodes() Jan Kara
2014-07-31 22:00 ` [PATCH 13/14] writeback: Comment update Jan Kara
2014-07-31 22:00 ` [PATCH 14/14] writeback: Per-sb dirty tracking Jan Kara
2014-08-01  5:14   ` Daniel Phillips
2014-08-05 23:44   ` Dave Chinner
2014-08-06  8:46     ` Jan Kara
2014-08-06 21:13       ` Dave Chinner
2014-08-08 10:46         ` Jan Kara
2014-08-10 23:16           ` Dave Chinner
2014-08-01  5:32 ` [RFC PATCH 00/14] Per-sb tracking of dirty inodes Daniel Phillips
2014-08-05  5:22 ` Dave Chinner [this message]
2014-08-05 10:31   ` Jan Kara
2014-08-05  8:20 ` Dave Chinner

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:e80d1b9 dfblob:2e80e80 )
 OR (
bs:"Re: [RFC PATCH 00/14] Per-sb tracking of dirty inodes" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140805052217.GD20518@dastard \
    --to=david@fromorbit.com \
    --cc=fengguang.wu@intel.com \
    --cc=hirofumi@mail.parknet.co.jp \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox