[PATCH] xfs: improve sync behaviour in face of aggressive dirtying

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] xfs: improve sync behaviour in face of aggressive dirtying
@ 2011-06-17 13:14 Christoph Hellwig
  2011-06-20  8:18 ` Christoph Hellwig
  0 siblings, 1 reply; 8+ messages in thread
From: Christoph Hellwig @ 2011-06-17 13:14 UTC (permalink / raw)
  To: xfs; +Cc: Wu Fengguang

The following script from Wu Fengguang shows very bad behaviour in XFS
when aggressively dirtying data during a sync on XFS, with sync times
up to almost 10 times as long as ext4.

A large part of the issue is that XFS writes data out itself two times
in the ->sync_fs method, overriding the lifelock protection in the core
writeback code, and another issue is the lock-less xfs_ioend_wait call,
which doesn't prevent new ioend from beeing queue up while waiting for
the count to reach zero.

This patch removes the XFS-internal sync calls and relies on the VFS
to do it's work just like all other filesystems do, and just uses the
internal inode iterator to wait for pending ioends, but now with the
iolock held.  With this fix we improve the sync time quite a bit,
but we'll need further work split i_iocount between pending direct I/O
requests that are only relevant for truncate, and pending unwritten
extent conversion that matter for sync and fsync.

------------------------------ snip ------------------------------
#!/bin/sh

umount /dev/sda7
mkfs.xfs -f /dev/sda7
# mkfs.ext4 /dev/sda7
# mkfs.btrfs /dev/sda7
mount /dev/sda7 /fs

echo $((50<<20)) > /proc/sys/vm/dirty_bytes

pid=
for i in `seq 10`
do
	dd if=/dev/zero of=/fs/zero-$i bs=1M count=1000 &
	pid="$pid $!"
done

sleep 1

tic=$(date +'%s')
sync
tac=$(date +'%s')

echo
echo sync time: $((tac-tic))
egrep '(Dirty|Writeback|NFS_Unstable)' /proc/meminfo

pidof dd > /dev/null && { kill -9 $pid; echo sync NOT livelocked; }
------------------------------ snip ------------------------------

Reported-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: xfs/fs/xfs/linux-2.6/xfs_sync.c
===================================================================
--- xfs.orig/fs/xfs/linux-2.6/xfs_sync.c	2011-06-17 14:16:18.442399481 +0200
+++ xfs/fs/xfs/linux-2.6/xfs_sync.c	2011-06-17 14:18:06.632394003 +0200
@@ -215,6 +215,19 @@ xfs_inode_ag_iterator(
 }
 
 STATIC int
+xfs_wait_ioend_cb(
+	struct xfs_inode	*ip,
+	struct xfs_perag	*pag,
+	int			flags)
+{
+	xfs_ilock(ip, XFS_IOLOCK_SHARED);
+	xfs_ioend_wait(ip);
+	xfs_iunlock(ip, XFS_IOLOCK_SHARED);
+
+	return 0;
+}
+
+STATIC int
 xfs_sync_inode_data(
 	struct xfs_inode	*ip,
 	struct xfs_perag	*pag,
@@ -359,14 +372,15 @@ xfs_quiesce_data(
 {
 	int			error, error2 = 0;
 
-	/* push non-blocking */
-	xfs_sync_data(mp, 0);
 	xfs_qm_sync(mp, SYNC_TRYLOCK);
-
-	/* push and block till complete */
-	xfs_sync_data(mp, SYNC_WAIT);
 	xfs_qm_sync(mp, SYNC_WAIT);
 
+	/* wait for all pending unwritten extent conversions */
+	xfs_inode_ag_iterator(mp, xfs_wait_ioend_cb, 0);
+
+	/* force out the newly dirtied log buffers */
+	xfs_log_force(mp, XFS_LOG_SYNC);
+
 	/* write superblock and hoover up shutdown errors */
 	error = xfs_sync_fsdata(mp);
 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] xfs: improve sync behaviour in face of aggressive dirtying
  2011-06-17 13:14 [PATCH] xfs: improve sync behaviour in face of aggressive dirtying Christoph Hellwig
@ 2011-06-20  8:18 ` Christoph Hellwig
  2011-06-21  0:33   ` Dave Chinner
  0 siblings, 1 reply; 8+ messages in thread
From: Christoph Hellwig @ 2011-06-20  8:18 UTC (permalink / raw)
  To: xfs; +Cc: Wu Fengguang

As confirmed by Wu moving to a workqueue flush as in the test patch
below brings our performance on par with other filesystems.  But there's
a major and a minor issues with that.

The minor one is that we always flush all work items and not just those
on the filesystem to be flushed.  This might become an issue for lager
systems, or when we apply a similar scheme to fsync, which has the same
underlying issue.

The major one is that flush_workqueue only flushed work items that were
queued before it was called, but we can requeue completions when we fail
to get the ilock in xfs_setfilesize, which can lead to losing i_size
updates when it happens.

I see two ways to fix this:  either we implement our own workqueue
look-alike based on the old workqueue code.  This would allow flushing
queues per-sb or even per-inode, and allow us to special case flushing
requeues as well before returning.

Or we copy the scheme ext4 uses for fsync (it completely fails to flush
the completion queue for plain sync), that is add a list of pending
items to the inode, and a lock to protect it.  I don't like this because
it a) bloats the inode, b) adds a lot of complexity, and c) another lock
to the irq I/O completion.

Any other ideas?


Index: xfs/fs/xfs/linux-2.6/xfs_sync.c
===================================================================
--- xfs.orig/fs/xfs/linux-2.6/xfs_sync.c	2011-06-17 14:16:18.442399481 +0200
+++ xfs/fs/xfs/linux-2.6/xfs_sync.c	2011-06-18 17:55:44.864025123 +0200
@@ -359,14 +359,16 @@ xfs_quiesce_data(
 {
 	int			error, error2 = 0;
 
-	/* push non-blocking */
-	xfs_sync_data(mp, 0);
 	xfs_qm_sync(mp, SYNC_TRYLOCK);
-
-	/* push and block till complete */
-	xfs_sync_data(mp, SYNC_WAIT);
 	xfs_qm_sync(mp, SYNC_WAIT);
 
+	/* flush all pending size updates and unwritten extent conversions */
+	flush_workqueue(xfsconvertd_workqueue);
+	flush_workqueue(xfsdatad_workqueue);
+
+	/* force out the newly dirtied log buffers */
+	xfs_log_force(mp, XFS_LOG_SYNC);
+
 	/* write superblock and hoover up shutdown errors */
 	error = xfs_sync_fsdata(mp);
 
Index: xfs/fs/xfs/linux-2.6/xfs_super.c
===================================================================
--- xfs.orig/fs/xfs/linux-2.6/xfs_super.c	2011-06-18 17:51:05.660705925 +0200
+++ xfs/fs/xfs/linux-2.6/xfs_super.c	2011-06-18 17:52:50.107367305 +0200
@@ -929,45 +929,12 @@ xfs_fs_write_inode(
 		 * ->sync_fs call do that for thus, which reduces the number
 		 * of synchronous log foces dramatically.
 		 */
-		xfs_ioend_wait(ip);
 		xfs_ilock(ip, XFS_ILOCK_SHARED);
-		if (ip->i_update_core) {
+		if (ip->i_update_core)
 			error = xfs_log_inode(ip);
-			if (error)
-				goto out_unlock;
-		}
-	} else {
-		/*
-		 * We make this non-blocking if the inode is contended, return
-		 * EAGAIN to indicate to the caller that they did not succeed.
-		 * This prevents the flush path from blocking on inodes inside
-		 * another operation right now, they get caught later by
-		 * xfs_sync.
-		 */
-		if (!xfs_ilock_nowait(ip, XFS_ILOCK_SHARED))
-			goto out;
-
-		if (xfs_ipincount(ip) || !xfs_iflock_nowait(ip))
-			goto out_unlock;
-
-		/*
-		 * Now we have the flush lock and the inode is not pinned, we
-		 * can check if the inode is really clean as we know that
-		 * there are no pending transaction completions, it is not
-		 * waiting on the delayed write queue and there is no IO in
-		 * progress.
-		 */
-		if (xfs_inode_clean(ip)) {
-			xfs_ifunlock(ip);
-			error = 0;
-			goto out_unlock;
-		}
-		error = xfs_iflush(ip, SYNC_TRYLOCK);
+		xfs_iunlock(ip, XFS_ILOCK_SHARED);
 	}
 
- out_unlock:
-	xfs_iunlock(ip, XFS_ILOCK_SHARED);
- out:
 	/*
 	 * if we failed to write out the inode then mark
 	 * it dirty again so we'll try again later.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] xfs: improve sync behaviour in face of aggressive dirtying
  2011-06-20  8:18 ` Christoph Hellwig
@ 2011-06-21  0:33   ` Dave Chinner
  2011-06-21  7:21     ` Michael Monnerie
  2011-06-21  9:29     ` Christoph Hellwig
  0 siblings, 2 replies; 8+ messages in thread
From: Dave Chinner @ 2011-06-21  0:33 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Wu Fengguang, xfs

On Mon, Jun 20, 2011 at 04:18:02AM -0400, Christoph Hellwig wrote:
> As confirmed by Wu moving to a workqueue flush as in the test patch
> below brings our performance on par with other filesystems.  But there's
> a major and a minor issues with that.

I like this much better than the previous inode iterator version.

> The minor one is that we always flush all work items and not just those
> on the filesystem to be flushed.  This might become an issue for lager
> systems, or when we apply a similar scheme to fsync, which has the same
> underlying issue.

For sync, I don't think it matters if we flush a few extra IO
completions on a busy system. For fsync, it could increase fsync
completion latency significantly, though I'd suggest we should
address that problem when we apply the scheme to fsync.

> The major one is that flush_workqueue only flushed work items that were
> queued before it was called, but we can requeue completions when we fail
> to get the ilock in xfs_setfilesize, which can lead to losing i_size
> updates when it happens.

Yes, I can see that will cause problems....

> I see two ways to fix this:  either we implement our own workqueue
> look-alike based on the old workqueue code.  This would allow flushing
> queues per-sb or even per-inode, and allow us to special case flushing
> requeues as well before returning.

No need for a look-alike.  With the CMWQ infrastructure, there is no
reason why we need global workqueues anymore. The log, data and
convert wqs were global to minimise the number of per-cpu threads
XFS required to operate. CMWQ prevents the explosion of mostly idle
kernel threads, so we could move all these workqueues to per- struct
xfs_mount without undue impact.

We already have buftarg->xfs_mount and ioend->xfs_mount
backpointers, so it would be trivial to do this conversion from the
queueing/flushing POV. That immediately reduces the scope of the
flushes necessary to sync a filesystem.

I don't think we want to go to per-inode work contexts. One
possibility is that for fsync related writeback (e.g. WB_SYNC_ALL)
we could have a separate "fsync-completion" wqs that we queue
completions to rather than the more widely used data workqueue. Then
for fsync we'd only need to flush the fsync-completion workqueue
rather than the mount wide data and convert wqs, and hence we
wouldn't stall on IO completion for IO outside of fsync scope...

> Or we copy the scheme ext4 uses for fsync (it completely fails to flush
> the completion queue for plain sync), that is add a list of pending
> items to the inode, and a lock to protect it.  I don't like this because
> it a) bloats the inode, b) adds a lot of complexity, and c) another lock
> to the irq I/O completion.

I'm not a great fan of that method, either. Using a separate wq
channel(s) for fsync completions seems like a much cleaner
solution to me...

Cheers,

Dave.
> 
> 
> Index: xfs/fs/xfs/linux-2.6/xfs_sync.c
> ===================================================================
> --- xfs.orig/fs/xfs/linux-2.6/xfs_sync.c	2011-06-17 14:16:18.442399481 +0200
> +++ xfs/fs/xfs/linux-2.6/xfs_sync.c	2011-06-18 17:55:44.864025123 +0200
> @@ -359,14 +359,16 @@ xfs_quiesce_data(
>  {
>  	int			error, error2 = 0;
>  
> -	/* push non-blocking */
> -	xfs_sync_data(mp, 0);
>  	xfs_qm_sync(mp, SYNC_TRYLOCK);
> -
> -	/* push and block till complete */
> -	xfs_sync_data(mp, SYNC_WAIT);
>  	xfs_qm_sync(mp, SYNC_WAIT);
>  
> +	/* flush all pending size updates and unwritten extent conversions */
> +	flush_workqueue(xfsconvertd_workqueue);
> +	flush_workqueue(xfsdatad_workqueue);
> +
> +	/* force out the newly dirtied log buffers */
> +	xfs_log_force(mp, XFS_LOG_SYNC);
> +
>  	/* write superblock and hoover up shutdown errors */
>  	error = xfs_sync_fsdata(mp);
>  
> Index: xfs/fs/xfs/linux-2.6/xfs_super.c
> ===================================================================
> --- xfs.orig/fs/xfs/linux-2.6/xfs_super.c	2011-06-18 17:51:05.660705925 +0200
> +++ xfs/fs/xfs/linux-2.6/xfs_super.c	2011-06-18 17:52:50.107367305 +0200
> @@ -929,45 +929,12 @@ xfs_fs_write_inode(
>  		 * ->sync_fs call do that for thus, which reduces the number
>  		 * of synchronous log foces dramatically.
>  		 */
> -		xfs_ioend_wait(ip);
>  		xfs_ilock(ip, XFS_ILOCK_SHARED);
> -		if (ip->i_update_core) {
> +		if (ip->i_update_core)
>  			error = xfs_log_inode(ip);
> -			if (error)
> -				goto out_unlock;
> -		}
> -	} else {
> -		/*
> -		 * We make this non-blocking if the inode is contended, return
> -		 * EAGAIN to indicate to the caller that they did not succeed.
> -		 * This prevents the flush path from blocking on inodes inside
> -		 * another operation right now, they get caught later by
> -		 * xfs_sync.
> -		 */
> -		if (!xfs_ilock_nowait(ip, XFS_ILOCK_SHARED))
> -			goto out;
> -
> -		if (xfs_ipincount(ip) || !xfs_iflock_nowait(ip))
> -			goto out_unlock;
> -
> -		/*
> -		 * Now we have the flush lock and the inode is not pinned, we
> -		 * can check if the inode is really clean as we know that
> -		 * there are no pending transaction completions, it is not
> -		 * waiting on the delayed write queue and there is no IO in
> -		 * progress.
> -		 */
> -		if (xfs_inode_clean(ip)) {
> -			xfs_ifunlock(ip);
> -			error = 0;
> -			goto out_unlock;
> -		}
> -		error = xfs_iflush(ip, SYNC_TRYLOCK);
> +		xfs_iunlock(ip, XFS_ILOCK_SHARED);
>  	}
>  
> - out_unlock:
> -	xfs_iunlock(ip, XFS_ILOCK_SHARED);
> - out:
>  	/*
>  	 * if we failed to write out the inode then mark
>  	 * it dirty again so we'll try again later.

Hmmm - I'm wondering if there is any performance implication for
removing the background inode writeback flush here. I expect it will
change inode flush patterns, and they are pretty good right now. I
think we need to check if most inode writeback is coming from the
AIL (due to log tail pushing) or from this function before making
this change....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] xfs: improve sync behaviour in face of aggressive dirtying
  2011-06-21  0:33   ` Dave Chinner
@ 2011-06-21  7:21     ` Michael Monnerie
  2011-06-22  0:19       ` Dave Chinner
  2011-06-21  9:29     ` Christoph Hellwig
  1 sibling, 1 reply; 8+ messages in thread
From: Michael Monnerie @ 2011-06-21  7:21 UTC (permalink / raw)
  To: xfs; +Cc: Christoph Hellwig, Wu Fengguang


[-- Attachment #1.1: Type: Text/Plain, Size: 1176 bytes --]

On Dienstag, 21. Juni 2011 Dave Chinner wrote:
> > The minor one is that we always flush all work items and not just
> > those on the filesystem to be flushed.  This might become an issue
> > for lager systems, or when we apply a similar scheme to fsync,
> > which has the same underlying issue.
> 
> For sync, I don't think it matters if we flush a few extra IO
> completions on a busy system.

Couldn't that be bad on a system with mixed fast/slow storage (say 15k 
SAS and 7.2k SATA), where on the busy fast SAS lots of syncs occur and 
lead to extra I/O on the SATA disks? Especially if there are 16 SAS 
disks in an array with RAID-0 against 4 SATA disks in RAID-6, to say the 
worst. If the SATAs are already heavy used (say >=50%), those extra 
writes could bring them to their knees.

I'm not sure how often syncs occur though, maybe that's why Dave says it 
shouldn't matter? AFAIK, databases generate heavy syncs though.

-- 
mit freundlichen Grüssen,
Michael Monnerie, Ing. BSc

it-management Internet Services: Protéger
http://proteger.at [gesprochen: Prot-e-schee]
Tel: +43 660 / 415 6531

// Haus zu verkaufen: http://zmi.at/langegg/

[-- Attachment #1.2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] xfs: improve sync behaviour in face of aggressive dirtying
  2011-06-21  7:21     ` Michael Monnerie
@ 2011-06-22  0:19       ` Dave Chinner
  0 siblings, 0 replies; 8+ messages in thread
From: Dave Chinner @ 2011-06-22  0:19 UTC (permalink / raw)
  To: Michael Monnerie; +Cc: Christoph Hellwig, Wu Fengguang, xfs

On Tue, Jun 21, 2011 at 09:21:46AM +0200, Michael Monnerie wrote:
> On Dienstag, 21. Juni 2011 Dave Chinner wrote:
> > > The minor one is that we always flush all work items and not just
> > > those on the filesystem to be flushed.  This might become an issue
> > > for lager systems, or when we apply a similar scheme to fsync,
> > > which has the same underlying issue.
> > 
> > For sync, I don't think it matters if we flush a few extra IO
> > completions on a busy system.
> 
> Couldn't that be bad on a system with mixed fast/slow storage (say 15k 
> SAS and 7.2k SATA), where on the busy fast SAS lots of syncs occur and 
> lead to extra I/O on the SATA disks? Especially if there are 16 SAS 
> disks in an array with RAID-0 against 4 SATA disks in RAID-6, to say the 
> worst. If the SATAs are already heavy used (say >=50%), those extra 
> writes could bring them to their knees.

We are not talking about issuing extra writes to disk, so you don't
have to worry about this. What we are talking about is how to
efficiently flush the XFS IO completion queues for writes that the
hardware has already completed. That's almost always just CPU
overhead and doesn't involve more IO.... ;)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] xfs: improve sync behaviour in face of aggressive dirtying
  2011-06-21  0:33   ` Dave Chinner
  2011-06-21  7:21     ` Michael Monnerie
@ 2011-06-21  9:29     ` Christoph Hellwig
  2011-06-22  1:09       ` Dave Chinner
  1 sibling, 1 reply; 8+ messages in thread
From: Christoph Hellwig @ 2011-06-21  9:29 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, Wu Fengguang, xfs

On Tue, Jun 21, 2011 at 10:33:43AM +1000, Dave Chinner wrote:
> > The minor one is that we always flush all work items and not just those
> > on the filesystem to be flushed.  This might become an issue for lager
> > systems, or when we apply a similar scheme to fsync, which has the same
> > underlying issue.
> 
> For sync, I don't think it matters if we flush a few extra IO
> completions on a busy system. For fsync, it could increase fsync
> completion latency significantly, though I'd suggest we should
> address that problem when we apply the scheme to fsync.

> > The major one is that flush_workqueue only flushed work items that were
> > queued before it was called, but we can requeue completions when we fail
> > to get the ilock in xfs_setfilesize, which can lead to losing i_size
> > updates when it happens.
> 
> Yes, I can see that will cause problems....
> 
> > I see two ways to fix this:  either we implement our own workqueue
> > look-alike based on the old workqueue code.  This would allow flushing
> > queues per-sb or even per-inode, and allow us to special case flushing
> > requeues as well before returning.
> 
> No need for a look-alike.  With the CMWQ infrastructure, there is no
> reason why we need global workqueues anymore. The log, data and
> convert wqs were global to minimise the number of per-cpu threads
> XFS required to operate. CMWQ prevents the explosion of mostly idle
> kernel threads, so we could move all these workqueues to per- struct
> xfs_mount without undue impact.

That helps to work around the impact of syncs on other filesystems, but
it does not help with the EAGAIN requeueing.

> I don't think we want to go to per-inode work contexts. One
> possibility is that for fsync related writeback (e.g. WB_SYNC_ALL)

WB_SYNC_ALL is also used for the second pass of sync.

> we could have a separate "fsync-completion" wqs that we queue
> completions to rather than the more widely used data workqueue. Then
> for fsync we'd only need to flush the fsync-completion workqueue
> rather than the mount wide data and convert wqs, and hence we
> wouldn't stall on IO completion for IO outside of fsync scope...

One thing I've thought about is abusing current->journal_info to
complete ioends in-processes for fsync.  That will need Josefs patch
to move the filemap_write_and_wait call into ->fsync.  At that point
we can do:

xfs_fsync() {

	current->journal_info = &ioend_end_list;

	filemap_write_and_wait();

	list_for_each_entry_reverse(ioend_end_list) {
		process_ioend();
	}

	current->journal_info = NULL;

which means there's no context switch involved, and as an added benefit
we processing the list reverse means we minimize the number of i_size
updates, which will be especially interesting with my changes to always
log these directly and get rid of the unlogged metadata changes.

It does not cover ioends always pending before we called fsync, but
I hope we can live with that.

What we still need to sort out with the workqueue scheme is a way to
deal with the EAGAIN returns from xfs_setfilesize.  From reading your
changelog for that change it seems we can simply kill it once we move
to per-mount completion queues, as the loop device would have it's own
workqueue.  If that sounds reasonable I'll respin a series to move to
per-mount workqueues, remove the EAGAIN case, and use the workqueue
flush in sync.  Fsync will be left for later, and I'll ping Josef to
resend his fsync prototype change.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] xfs: improve sync behaviour in face of aggressive dirtying
  2011-06-21  9:29     ` Christoph Hellwig
@ 2011-06-22  1:09       ` Dave Chinner
  2011-06-22  6:51         ` Christoph Hellwig
  0 siblings, 1 reply; 8+ messages in thread
From: Dave Chinner @ 2011-06-22  1:09 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Wu Fengguang, xfs

On Tue, Jun 21, 2011 at 05:29:20AM -0400, Christoph Hellwig wrote:
> On Tue, Jun 21, 2011 at 10:33:43AM +1000, Dave Chinner wrote:
> > > The minor one is that we always flush all work items and not just those
> > > on the filesystem to be flushed.  This might become an issue for lager
> > > systems, or when we apply a similar scheme to fsync, which has the same
> > > underlying issue.
> > 
> > For sync, I don't think it matters if we flush a few extra IO
> > completions on a busy system. For fsync, it could increase fsync
> > completion latency significantly, though I'd suggest we should
> > address that problem when we apply the scheme to fsync.
> 
> > > The major one is that flush_workqueue only flushed work items that were
> > > queued before it was called, but we can requeue completions when we fail
> > > to get the ilock in xfs_setfilesize, which can lead to losing i_size
> > > updates when it happens.
> > 
> > Yes, I can see that will cause problems....
> > 
> > > I see two ways to fix this:  either we implement our own workqueue
> > > look-alike based on the old workqueue code.  This would allow flushing
> > > queues per-sb or even per-inode, and allow us to special case flushing
> > > requeues as well before returning.
> > 
> > No need for a look-alike.  With the CMWQ infrastructure, there is no
> > reason why we need global workqueues anymore. The log, data and
> > convert wqs were global to minimise the number of per-cpu threads
> > XFS required to operate. CMWQ prevents the explosion of mostly idle
> > kernel threads, so we could move all these workqueues to per- struct
> > xfs_mount without undue impact.
> 
> That helps to work around the impact of syncs on other filesystems, but
> it does not help with the EAGAIN requeueing.

Right. I wasn't suggesting that it would. I was just talking out
loud about why we have the current structure and how the reasons for
it existing weren't relevant anymore. Hence changing the workqueue
context is something we should be able to do with little impact....

> > I don't think we want to go to per-inode work contexts. One
> > possibility is that for fsync related writeback (e.g. WB_SYNC_ALL)
> 
> WB_SYNC_ALL is also used for the second pass of sync.

Sure, but that is effectively the same case as fsync as most of the
dirty pages would already be written by the WB_SYNC_NONE pass.
Regardless, all and extra fsync completion queue would mean is that
that sync would also need to flush the fsync-completion queue.....

> > we could have a separate "fsync-completion" wqs that we queue
> > completions to rather than the more widely used data workqueue. Then
> > for fsync we'd only need to flush the fsync-completion workqueue
> > rather than the mount wide data and convert wqs, and hence we
> > wouldn't stall on IO completion for IO outside of fsync scope...
> 
> One thing I've thought about is abusing current->journal_info to
> complete ioends in-processes for fsync.

Oh, now that is devious. I've never considered abusing that field
in the task context before ;)

> That will need Josefs patch
> to move the filemap_write_and_wait call into ->fsync.  At that point
> we can do:
> 
> xfs_fsync() {
> 
> 	current->journal_info = &ioend_end_list;
> 
> 	filemap_write_and_wait();
> 
> 	list_for_each_entry_reverse(ioend_end_list) {
> 		process_ioend();
> 	}
> 
> 	current->journal_info = NULL;
> 
> which means there's no context switch involved, and as an added benefit
> we processing the list reverse means we minimize the number of i_size
> updates, which will be especially interesting with my changes to always
> log these directly and get rid of the unlogged metadata changes.

All good, except I think there's a small problem with this - we have
to process the ioends before pages will transition from WRITEBACK to
clean. i.e. it is not until xfs_ioend_destroy() that we call the
bh->b_end_io() function to update the page state. Hence it would
have to be:

xfs_fsync() {

	current->journal_info = &ioend_end_list;

	filemap_fdatawrite();

	list_for_each_entry_reverse(ioend_end_list) {
		/* process_ioend also waits for ioend completion */
		process_ioend();
	}

	current->journal_info = NULL;

	filemap_fdatawait();
}

> It does not cover ioends always pending before we called fsync, but
> I hope we can live with that.

That is not a problem - those pages will be marked WRITEBACK so the
filemap_fdatawait() will catch them. All that will happen is we'll
take a latency hit when there is a concurrent cleaner running.

Direct IO is another matter, but we've already got an
xfs_ioend_wait() in xfs_fsync() to deal with that. Perhaps that
could be moved over to your new DIO counter so we do block on all
pending IO?

> What we still need to sort out with the workqueue scheme is a way to
> deal with the EAGAIN returns from xfs_setfilesize.  From reading your
> changelog for that change it seems we can simply kill it once we move
> to per-mount completion queues, as the loop device would have it's own
> workqueue. 

Yes, that was the primary issue it was intended to solve. per-mount
work queues was rejected at the time because of the thread explosion
it would cause. That's not a problem now ;)

I'm just trying to remember if there were other problems that the
requeue also solved. I think it was to do with blocking IO
completions when memory is low (ie. something it holding the ilock
but blocking on memory allocation), but that, once again, is solved
with CMWQ allowing multiple outstanding work items to be in
progress concurrently.

> If that sounds reasonable I'll respin a series to move to
> per-mount workqueues, remove the EAGAIN case, and use the workqueue
> flush in sync.  Fsync will be left for later, and I'll ping Josef to
> resend his fsync prototype change.

Yes, sounds like a plan.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] xfs: improve sync behaviour in face of aggressive dirtying
  2011-06-22  1:09       ` Dave Chinner
@ 2011-06-22  6:51         ` Christoph Hellwig
  0 siblings, 0 replies; 8+ messages in thread
From: Christoph Hellwig @ 2011-06-22  6:51 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, Wu Fengguang, xfs

On Wed, Jun 22, 2011 at 11:09:11AM +1000, Dave Chinner wrote:
> All good, except I think there's a small problem with this - we have
> to process the ioends before pages will transition from WRITEBACK to
> clean. i.e. it is not until xfs_ioend_destroy() that we call the
> bh->b_end_io() function to update the page state. Hence it would
> have to be:
> 
> xfs_fsync() {
> 
> 	current->journal_info = &ioend_end_list;
> 
> 	filemap_fdatawrite();
> 
> 	list_for_each_entry_reverse(ioend_end_list) {
> 		/* process_ioend also waits for ioend completion */
> 		process_ioend();
> 	}
> 
> 	current->journal_info = NULL;
> 
> 	filemap_fdatawait();

Indeed.

> Direct IO is another matter, but we've already got an
> xfs_ioend_wait() in xfs_fsync() to deal with that. Perhaps that
> could be moved over to your new DIO counter so we do block on all
> pending IO?

Splitting the pending direct I/O requests into the one is indeed the
plan.  We'll still need to track ioends for them, though - and I haven't
though about thedetails for those yet.

> > If that sounds reasonable I'll respin a series to move to
> > per-mount workqueues, remove the EAGAIN case, and use the workqueue
> > flush in sync.  Fsync will be left for later, and I'll ping Josef to
> > resend his fsync prototype change.
> 
> Yes, sounds like a plan.

I've implemented it yesterday, and it appears to work fine.  But there's
another issues I found:  the flush_workqueue will update i_size and mark
the inodes dirty right now from ->sync_fs, but that's after we've done
the VFS writeback.  I guess I nees to order this patch after the one
I'm working on to stop doing non-transaction inode updates.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2011-06-22  6:51 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-06-17 13:14 [PATCH] xfs: improve sync behaviour in face of aggressive dirtying Christoph Hellwig
2011-06-20  8:18 ` Christoph Hellwig
2011-06-21  0:33   ` Dave Chinner
2011-06-21  7:21     ` Michael Monnerie
2011-06-22  0:19       ` Dave Chinner
2011-06-21  9:29     ` Christoph Hellwig
2011-06-22  1:09       ` Dave Chinner
2011-06-22  6:51         ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox