[PATCH] writeback: plug writeback at a high level

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] writeback: plug writeback at a high level
@ 2013-06-15  2:50 Dave Chinner
  2013-06-17 14:34 ` Chris Mason
  0 siblings, 1 reply; 5+ messages in thread
From: Dave Chinner @ 2013-06-15  2:50 UTC (permalink / raw)
  To: linux-fsdevel

From: Dave Chinner <dchinner@redhat.com>

Doing writeback on lots of little files causes terrible IOPS storms
because of the per-mapping writeback plugging we do. This
essentially causes imeediate dispatch of IO for each mapping,
regardless of the context in which writeback is occurring.

IOWs, running a concurrent write-lots-of-small 4k files using fsmark
on XFS results in a huge number of IOPS being issued for data
writes.  Metadata writes are sorted and plugged at a high level by
XFS, so aggregate nicely into large IOs. However, data writeback IOs
are dispatched in individual 4k IOs, even when the blocks of two
consecutively written files are adjacent.

Test VM: 8p, 8GB RAM, 4xSSD in RAID0, 100TB sparse XFS filesystem,
metadata CRCs enabled.

Kernel: 3.10-rc5 + xfsdev + my 3.11 xfs queue (~70 patches)

Test:

$ ./fs_mark  -D  10000  -S0  -n  10000  -s  4096  -L  120  -d
/mnt/scratch/0  -d  /mnt/scratch/1  -d  /mnt/scratch/2  -d
/mnt/scratch/3  -d  /mnt/scratch/4  -d  /mnt/scratch/5  -d
/mnt/scratch/6  -d  /mnt/scratch/7

Result:

		wall	sys	create rate	Physical write IO
		time	CPU	(avg files/s)	 IOPS	Bandwidth
		-----	-----	------------	------	---------
unpatched	6m56s	15m47s	24,000+/-500	26,000	130MB/s
patched		5m06s	13m28s	32,800+/-600	 1,500	180MB/s
improvement	-26.44%	-14.68%	  +36.67%	-94.23%	+38.46%

If I use zero length files, this workload at about 500 IOPS, so
plugging drops the data IOs from roughly 25,500/s to 1000/s.
3 lines of code, 35% better throughput for 15% less CPU.

The benefits of plugging at this layer are likely to be higher for
spinning media as the IO patterns for this workload are going make a
much bigger difference on high IO latency devices.....

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/fs-writeback.c |    3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 3be5718..996f91a 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -585,7 +585,9 @@ static long writeback_sb_inodes(struct super_block *sb,
 	unsigned long start_time = jiffies;
 	long write_chunk;
 	long wrote = 0;  /* count both pages and inodes */
+	struct blk_plug plug;

+	blk_start_plug(&plug);
 	while (!list_empty(&wb->b_io)) {
 		struct inode *inode = wb_inode(wb->b_io.prev);

@@ -682,6 +684,7 @@ static long writeback_sb_inodes(struct super_block *sb,
 				break;
 		}
 	}
+	blk_finish_plug(&plug);
 	return wrote;
 }

-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] writeback: plug writeback at a high level
  2013-06-15  2:50 [PATCH] writeback: plug writeback at a high level Dave Chinner
@ 2013-06-17 14:34 ` Chris Mason
  2013-06-18  1:58   ` Dave Chinner
  0 siblings, 1 reply; 5+ messages in thread
From: Chris Mason @ 2013-06-17 14:34 UTC (permalink / raw)
  To: Dave Chinner, linux-fsdevel@vger.kernel.org

Quoting Dave Chinner (2013-06-14 22:50:50)
> From: Dave Chinner <dchinner@redhat.com>
> 
> Doing writeback on lots of little files causes terrible IOPS storms
> because of the per-mapping writeback plugging we do. This
> essentially causes imeediate dispatch of IO for each mapping,
> regardless of the context in which writeback is occurring.
> 
> IOWs, running a concurrent write-lots-of-small 4k files using fsmark
> on XFS results in a huge number of IOPS being issued for data
> writes.  Metadata writes are sorted and plugged at a high level by
> XFS, so aggregate nicely into large IOs. However, data writeback IOs
> are dispatched in individual 4k IOs, even when the blocks of two
> consecutively written files are adjacent.
> 
> Test VM: 8p, 8GB RAM, 4xSSD in RAID0, 100TB sparse XFS filesystem,
> metadata CRCs enabled.
> 
> Kernel: 3.10-rc5 + xfsdev + my 3.11 xfs queue (~70 patches)

I'm a little worried about this one, just because of the impact on ssds
from plugging in the aio code:

https://lkml.org/lkml/2011/12/13/326

How exactly was your FS created?  I'll try it here.

-chris

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] writeback: plug writeback at a high level
  2013-06-17 14:34 ` Chris Mason
@ 2013-06-18  1:58   ` Dave Chinner
  2013-06-18 11:16     ` Chris Mason
  0 siblings, 1 reply; 5+ messages in thread
From: Dave Chinner @ 2013-06-18  1:58 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-fsdevel@vger.kernel.org

On Mon, Jun 17, 2013 at 10:34:57AM -0400, Chris Mason wrote:
> Quoting Dave Chinner (2013-06-14 22:50:50)
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Doing writeback on lots of little files causes terrible IOPS storms
> > because of the per-mapping writeback plugging we do. This
> > essentially causes imeediate dispatch of IO for each mapping,
> > regardless of the context in which writeback is occurring.
> > 
> > IOWs, running a concurrent write-lots-of-small 4k files using fsmark
> > on XFS results in a huge number of IOPS being issued for data
> > writes.  Metadata writes are sorted and plugged at a high level by
> > XFS, so aggregate nicely into large IOs. However, data writeback IOs
> > are dispatched in individual 4k IOs, even when the blocks of two
> > consecutively written files are adjacent.
> > 
> > Test VM: 8p, 8GB RAM, 4xSSD in RAID0, 100TB sparse XFS filesystem,
> > metadata CRCs enabled.
> > 
> > Kernel: 3.10-rc5 + xfsdev + my 3.11 xfs queue (~70 patches)
> 
> I'm a little worried about this one, just because of the impact on ssds
> from plugging in the aio code:

I'm testing on SSDs, but the impact has nothing to do with the
underlying storage.

> https://lkml.org/lkml/2011/12/13/326

Oh, that's completely different situation - it's application
submitted IO wherhe Io latency is a deterimining factor in
performance.

This is for background writeback, where IO latency is not a primary
performance issue - maximum throughput is what we are trying to
acheive here. For writeback, well formed IO has a much greater
impact on throughput than low latency IO submission, even for SSD
based storage.

> How exactly was your FS created?  I'll try it here.

The host has an XFS filesystem on a md RAID0 of 4x40Gb slices off
larger SSDs:

$ cat /proc/mdstat
Personalities : [raid0] 
md2 : active raid0 sdb1[0] sde1[3] sdd1[2] sdc1[1]
      167772032 blocks super 1.2 32k chunks

built with mkfs.xfs defaults. A sparse 100TB file is created and
then fed to KVM with cache=none,virtio.

The guest formats the 100TB device using default mkfs.xfs parameters
and uses default mount options, so it's a 100TB filesytsem with
about 150GB of real storage in it...

The underlying hardware controller that the SSDs are attached to is
limited to roughly 27,000 random 4k write IOPS, and that's the IO
pattern that writeback is splattering at the device.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] writeback: plug writeback at a high level
  2013-06-18  1:58   ` Dave Chinner
@ 2013-06-18 11:16     ` Chris Mason
  0 siblings, 0 replies; 5+ messages in thread
From: Chris Mason @ 2013-06-18 11:16 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel@vger.kernel.org

Quoting Dave Chinner (2013-06-17 21:58:06)
> On Mon, Jun 17, 2013 at 10:34:57AM -0400, Chris Mason wrote:
> > Quoting Dave Chinner (2013-06-14 22:50:50)
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > Doing writeback on lots of little files causes terrible IOPS storms
> > > because of the per-mapping writeback plugging we do. This
> > > essentially causes imeediate dispatch of IO for each mapping,
> > > regardless of the context in which writeback is occurring.
> > > 
> > > IOWs, running a concurrent write-lots-of-small 4k files using fsmark
> > > on XFS results in a huge number of IOPS being issued for data
> > > writes.  Metadata writes are sorted and plugged at a high level by
> > > XFS, so aggregate nicely into large IOs. However, data writeback IOs
> > > are dispatched in individual 4k IOs, even when the blocks of two
> > > consecutively written files are adjacent.
> > > 
> > > Test VM: 8p, 8GB RAM, 4xSSD in RAID0, 100TB sparse XFS filesystem,
> > > metadata CRCs enabled.
> > > 
> > > Kernel: 3.10-rc5 + xfsdev + my 3.11 xfs queue (~70 patches)
> > 
> > I'm a little worried about this one, just because of the impact on ssds
> > from plugging in the aio code:
> 
> I'm testing on SSDs, but the impact has nothing to do with the
> underlying storage.
> 
> > https://lkml.org/lkml/2011/12/13/326
> 
> Oh, that's completely different situation - it's application
> submitted IO wherhe Io latency is a deterimining factor in
> performance.
> 
> This is for background writeback, where IO latency is not a primary
> performance issue - maximum throughput is what we are trying to
> acheive here. For writeback, well formed IO has a much greater
> impact on throughput than low latency IO submission, even for SSD
> based storage.

Very true, but at the same time we do wait for background writeback
sometimes.  It's worth a quick test...

> 
> > How exactly was your FS created?  I'll try it here.
> 
> The host has an XFS filesystem on a md RAID0 of 4x40Gb slices off
> larger SSDs:
> 
> $ cat /proc/mdstat
> Personalities : [raid0] 
> md2 : active raid0 sdb1[0] sde1[3] sdd1[2] sdc1[1]
>       167772032 blocks super 1.2 32k chunks
> 
> built with mkfs.xfs defaults. A sparse 100TB file is created and
> then fed to KVM with cache=none,virtio.
> 
> The guest formats the 100TB device using default mkfs.xfs parameters
> and uses default mount options, so it's a 100TB filesytsem with
> about 150GB of real storage in it...
> 
> The underlying hardware controller that the SSDs are attached to is
> limited to roughly 27,000 random 4k write IOPS, and that's the IO
> pattern that writeback is splattering at the device.

Ok, I'll try something less exotic here ;)

-chris


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH 0/2 v2] Fix data corruption when blocksize < pagesize for mmapped data
@ 2014-10-10 14:23 Jan Kara
  2014-10-10 14:23 ` [PATCH] writeback: plug writeback at a high level Jan Kara
  0 siblings, 1 reply; 5+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-ext4, Dave Chinner, xfs, cluster-devel, Steven Whitehouse,
	Mark Fasheh, Joel Becker, ocfs2-devel, reiserfs-devel,
	Jeff Mahoney, Dave Kleikamp, jfs-discussion, tytso, viro,
	Jan Kara

  Hello,

  this is a second version of the patches to fix data corruption in mmapped
data when blocksize < pagesize as tested by xfstests generic/030 test.
The patchset fixes XFS and ext4. I've checked and btrfs doesn't need fixing
because it doesn't support blocksize < pagesize. If that's ever going
to change btrfs will likely need a similar treatment. ocfs2, ext2, ext3 are
OK since they happily allocate blocks during writeback. For other filesystems
like gfs2, ubifs, nilfs, ceph,... I'm not sure whether they support blocksize <
pagesize at all. Interesting is also NFS which may care but I don't understand
its ->page_mkwrite() handler good enough to judge.

Changes since v1:
- changed helper function name and moved it to mm/truncate.c - I originally
  thought we can make the helper function update i_size to simplify the
  interface but it's actually impossible due to generic_write_end() lock
  ordering constraints.
- used round_up() instead of ALIGN()
- taught truncate_setsize() to use the helper function

								Honza

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH] writeback: plug writeback at a high level
  2014-10-10 14:23 [PATCH 0/2 v2] Fix data corruption when blocksize < pagesize for mmapped data Jan Kara
@ 2014-10-10 14:23 ` Jan Kara
  0 siblings, 0 replies; 5+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	Dave Chinner, reiserfs-devel, xfs, cluster-devel, Joel Becker,
	Dave Chinner, Jan Kara, linux-ext4, Steven Whitehouse,
	ocfs2-devel, viro

From: Dave Chinner <dchinner@redhat.com>

tl;dr: 3 lines of code, 86% better fsmark thoughput consuming 13%
less CPU and 43% lower runtime.

Doing writeback on lots of little files causes terrible IOPS storms
because of the per-mapping writeback plugging we do. This
essentially causes imeediate dispatch of IO for each mapping,
regardless of the context in which writeback is occurring.

IOWs, running a concurrent write-lots-of-small 4k files using fsmark
on XFS results in a huge number of IOPS being issued for data
writes.  Metadata writes are sorted and plugged at a high level by
XFS, so aggregate nicely into large IOs.

However, data writeback IOs are dispatched in individual 4k IOs -
even when the blocks of two consecutively written files are
adjacent - because the underlying block device is fast enough not to
congest on such IO. This behaviour is not SSD related - anything
with hardware caches is going to see the same benefits as the IO
rates are limited only by how fast adjacent IOs can be sent to the
hardware caches for aggregation.

Hence the speed of the physical device is irrelevant to this common
writeback workload (happens every time you untar a tarball!) -
performance is limited by the overhead of dispatching individual
IOs from a single writeback thread.

Test VM: 16p, 16GB RAM, 2xSSD in RAID0, 500TB sparse XFS filesystem,
metadata CRCs enabled.

Test:

$ ./fs_mark  -D  10000  -S0  -n  10000  -s  4096  -L  120  -d
/mnt/scratch/0  -d  /mnt/scratch/1  -d  /mnt/scratch/2  -d
/mnt/scratch/3  -d  /mnt/scratch/4  -d  /mnt/scratch/5  -d
/mnt/scratch/6  -d  /mnt/scratch/7

Result:
		wall	sys	create rate	Physical write IO
		time	CPU	(avg files/s)	 IOPS	Bandwidth
		-----	-----	-------------	------	---------
unpatched	5m54s	15m32s	32,500+/-2200	28,000	150MB/s
patched		3m19s	13m28s	52,900+/-1800	 1,500	280MB/s
improvement	-43.8%	-13.3%	  +62.7%	-94.6%	+86.6%

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 279292ba9403..d935fd3796ba 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -599,6 +599,9 @@ static long generic_writeback_inodes(struct wb_writeback_work *work)
 	unsigned long end_time = jiffies + HZ / 10;
 	long write_chunk;
 	long wrote = 0;  /* count both pages and inodes */
+	struct blk_plug plug;
+
+	blk_start_plug(&plug);

 	spin_lock(&wb->list_lock);
 	while (1) {
@@ -688,6 +691,8 @@ static long generic_writeback_inodes(struct wb_writeback_work *work)
 out:
 	spin_unlock(&wb->list_lock);

+	blk_finish_plug(&plug);
+
 	return wrote;
 }

-- 
1.8.1.4

------------------------------------------------------------------------------
Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports
Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper
Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer
http://pubads.g.doubleclick.net/gampad/clk?id=154622311&iu=/4140/ostg.clktrk

^ permalink raw reply related	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2014-10-10 14:23 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-06-15  2:50 [PATCH] writeback: plug writeback at a high level Dave Chinner
2013-06-17 14:34 ` Chris Mason
2013-06-18  1:58   ` Dave Chinner
2013-06-18 11:16     ` Chris Mason
  -- strict thread matches above, loose matches on Subject: below --
2014-10-10 14:23 [PATCH 0/2 v2] Fix data corruption when blocksize < pagesize for mmapped data Jan Kara
2014-10-10 14:23 ` [PATCH] writeback: plug writeback at a high level Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).