[PATCH 0/6] xfs: Reduce OOM kill problems under heavy load V2

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/6] xfs: Reduce OOM kill problems under heavy load V2
@ 2011-03-10  0:05 Dave Chinner
  2011-03-10  0:05 ` [PATCH 1/6] xfs: introduce inode cluster buffer trylocks for xfs_iflush Dave Chinner
                   ` (5 more replies)
  0 siblings, 6 replies; 22+ messages in thread
From: Dave Chinner @ 2011-03-10  0:05 UTC (permalink / raw)
  To: xfs

Chris Mason reported recently that a concurent stress test
(basically copying the linux kernel tree 20 times, verifying md5sums
and deleting it in a loop concurrently) under low memory conditions
was triggering the OOM killer muchmore easily than for btrfs.

Turns out there are two main problems. The first is that unlinked
inodes were not being reclaimed fast enough, leading to the OOM
being declared when there are large numbers of reclaimable inodes
still around. The second was that atime updates due to the verify
step were creating large numbers of dirty inodes at the VFS level
that were not being written back and hence made reclaimable before
the system declared OOM and killed stuff.

The first problem is fixed by making background inode reclaim more
frequent and faster, kicking background reclaim from the inode cache
shrinker so that when memory is low we always have background inode
reclaim in progress, and finally making the shrinker reclaim scan
block waiting on inodes to reclaim. This last step throttles memory
reclaim to the speed at which we can reclaim inodes, a key step
needed to prevent inodes from reclaim declaring OOM while there are
still reclaimable inodes around. The background inode reclaim
prevents this synchronous flush from finding dirty inodes and block
on them in most cases and hence prevents performance regressions in
more common workloads due to reclaim stalls.

To enable this new functionality, the xfssyncd thread is replaced
with a workqueue and the existing xfssyncd work replaced with a
global workqueue.  Hence all filesystems will share the same
workqueue and we remove allt eh xfssyncd threads from the system.
The ENOSPC inode flush is converted to use the workqueue, and
optimised to only allow a single flush at a time. This significant
speeds up ENOSPC processing under concurrent workloads as it removes
all the unnecessary scanning that every single ENOSPC event
currently queues to the xfssyncd. Finally, a new reinode reclaim
worker is added to the workqueue that runs 5x more frequently that
the xfssyncd to do the background inode reclaim scan.

The second problem is could be fixed by making the XFS inode cache
shrinker kick the bdi flusher to write back inodes if the bdi
flusher is not already active.  This, however, causes deadlocks when
the bdi-flusher thread needs to be forked under memory pressure, so
these patches have been dropped for now.

An addition to the series is to push the AIL when we are under
memory stress to speed up the cleaning of dirty metadata. While this
does not avoid the problems with VFS level dirty objects, it does
ensure we don't keep lots of dirty objects that the VFS knows
nothing about pinned in memory when there is a shortage.

By also pushing the AIL doing the periodic syncd work, we ensure
that we always clear out dirty objects from the AIL regularly and
hence allow the filesytem to idle correctly. Conversion of the
xfsaild thread to a workqueue also simplifies the push trigger
mechanism for cleaning the AIL and removes the need for a thread per
filesystem for this work.

Version 2:
- drop writeback_inodes_sb_nr_if_idle() patch
- no need for work_pending checks before queueing - the queuing
  already does this atomically.
- remove xfs_syncd_lock as it is not necessary anymore
- simplify inode reclaim trigger
- add AIL pushing patches.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/6] xfs: introduce inode cluster buffer trylocks for xfs_iflush
  2011-03-10  0:05 [PATCH 0/6] xfs: Reduce OOM kill problems under heavy load V2 Dave Chinner
@ 2011-03-10  0:05 ` Dave Chinner
  2011-03-10 17:31   ` Christoph Hellwig
  2011-03-10  0:05 ` [PATCH 2/6] xfs: introduce a xfssyncd workqueue Dave Chinner
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 22+ messages in thread
From: Dave Chinner @ 2011-03-10  0:05 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

There is an ABBA deadlock between synchronous inode flushing in
xfs_reclaim_inode and xfs_icluster_free. xfs_icluster_free locks the
buffer, then takes inode ilocks, whilst synchronous reclaim takes
the ilock followed by the buffer lock in xfs_iflush().

To avoid this deadlock, separate the inode cluster buffer locking
semantics from the synchronous inode flush semantics, allowing
callers to attempt to lock the buffer but still issue synchronous IO
if it can get the buffer. This requires xfs_iflush() calls that
currently use non-blocking semantics to pass SYNC_TRYLOCK rather
than 0 as the flags parameter.

This allows xfs_reclaim_inode to avoid the deadlock on the buffer
lock and detect the failure so that it can drop the inode ilock and
restart the reclaim attempt on the inode. This allows
xfs_ifree_cluster to obtain the inode lock, mark the inode stale and
release it and hence defuse the deadlock situation. It also has the
pleasant side effect of avoiding IO in xfs_reclaim_inode when it
tries to next reclaim the inode as it is now marked stale.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/linux-2.6/xfs_super.c |    2 +-
 fs/xfs/linux-2.6/xfs_sync.c  |   30 +++++++++++++++++++++++++++---
 fs/xfs/xfs_inode.c           |    2 +-
 fs/xfs/xfs_inode_item.c      |    6 +++---
 4 files changed, 32 insertions(+), 8 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_super.c b/fs/xfs/linux-2.6/xfs_super.c
index 818c4cf..8a70b2a 100644
--- a/fs/xfs/linux-2.6/xfs_super.c
+++ b/fs/xfs/linux-2.6/xfs_super.c
@@ -1078,7 +1078,7 @@ xfs_fs_write_inode(
 			error = 0;
 			goto out_unlock;
 		}
-		error = xfs_iflush(ip, 0);
+		error = xfs_iflush(ip, SYNC_TRYLOCK);
 	}
 
  out_unlock:
diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c
index 6c10f1d..594cd82 100644
--- a/fs/xfs/linux-2.6/xfs_sync.c
+++ b/fs/xfs/linux-2.6/xfs_sync.c
@@ -761,8 +761,10 @@ xfs_reclaim_inode(
 	struct xfs_perag	*pag,
 	int			sync_mode)
 {
-	int	error = 0;
+	int	error;
 
+restart:
+	error = 0;
 	xfs_ilock(ip, XFS_ILOCK_EXCL);
 	if (!xfs_iflock_nowait(ip)) {
 		if (!(sync_mode & SYNC_WAIT))
@@ -788,9 +790,31 @@ xfs_reclaim_inode(
 	if (xfs_inode_clean(ip))
 		goto reclaim;
 
-	/* Now we have an inode that needs flushing */
-	error = xfs_iflush(ip, sync_mode);
+	/*
+	 * Now we have an inode that needs flushing.
+	 *
+	 * We do a nonblocking flush here even if we are doing a SYNC_WAIT
+	 * reclaim as we can deadlock with inode cluster removal.
+	 * xfs_ifree_cluster() can lock the inode buffer before it locks the
+	 * ip->i_lock, and we are doing the exact opposite here. As a result,
+	 * doing a blocking xfs_itobp() to get the cluster buffer will result
+	 * in an ABBA deadlock with xfs_ifree_cluster().
+	 *
+	 * As xfs_ifree_cluser() must gather all inodes that are active in the
+	 * cache to mark them stale, if we hit this case we don't actually want
+	 * to do IO here - we want the inode marked stale so we can simply
+	 * reclaim it. Hence if we get an EAGAIN error on a SYNC_WAIT flush,
+	 * just unlock the inode, back off and try again. Hopefully the next
+	 * pass through will see the stale flag set on the inode.
+	 */
+	error = xfs_iflush(ip, SYNC_TRYLOCK | sync_mode);
 	if (sync_mode & SYNC_WAIT) {
+		if (error == EAGAIN) {
+			xfs_iunlock(ip, XFS_ILOCK_EXCL);
+			/* backoff longer than in xfs_ifree_cluster */
+			delay(2);
+			goto restart;
+		}
 		xfs_iflock(ip);
 		goto reclaim;
 	}
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index da871f5..742c833 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2835,7 +2835,7 @@ xfs_iflush(
 	 * Get the buffer containing the on-disk inode.
 	 */
 	error = xfs_itobp(mp, NULL, ip, &dip, &bp,
-				(flags & SYNC_WAIT) ? XBF_LOCK : XBF_TRYLOCK);
+				(flags & SYNC_TRYLOCK) ? XBF_TRYLOCK : XBF_LOCK);
 	if (error || !bp) {
 		xfs_ifunlock(ip);
 		return error;
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index fd4f398..46cc401 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -760,11 +760,11 @@ xfs_inode_item_push(
 	 * Push the inode to it's backing buffer. This will not remove the
 	 * inode from the AIL - a further push will be required to trigger a
 	 * buffer push. However, this allows all the dirty inodes to be pushed
-	 * to the buffer before it is pushed to disk. THe buffer IO completion
-	 * will pull th einode from the AIL, mark it clean and unlock the flush
+	 * to the buffer before it is pushed to disk. The buffer IO completion
+	 * will pull the inode from the AIL, mark it clean and unlock the flush
 	 * lock.
 	 */
-	(void) xfs_iflush(ip, 0);
+	(void) xfs_iflush(ip, SYNC_TRYLOCK);
 	xfs_iunlock(ip, XFS_ILOCK_SHARED);
 }
 
-- 
1.7.2.3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/6] xfs: introduce inode cluster buffer trylocks for xfs_iflush
  2011-03-10  0:05 ` [PATCH 1/6] xfs: introduce inode cluster buffer trylocks for xfs_iflush Dave Chinner
@ 2011-03-10 17:31   ` Christoph Hellwig
  0 siblings, 0 replies; 22+ messages in thread
From: Christoph Hellwig @ 2011-03-10 17:31 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 2/6] xfs: introduce a xfssyncd workqueue
  2011-03-10  0:05 [PATCH 0/6] xfs: Reduce OOM kill problems under heavy load V2 Dave Chinner
  2011-03-10  0:05 ` [PATCH 1/6] xfs: introduce inode cluster buffer trylocks for xfs_iflush Dave Chinner
@ 2011-03-10  0:05 ` Dave Chinner
  2011-03-10 17:34   ` Christoph Hellwig
  2011-03-10  0:05 ` [PATCH 3/6] xfs: convert ENOSPC inode flushing to use new syncd workqueue Dave Chinner
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 22+ messages in thread
From: Dave Chinner @ 2011-03-10  0:05 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

All of the work xfssyncd does is background functionality. There is
no need for a thread per filesystem to do this work - it can al be
managed by a global workqueue now they manage concurrency
effectively.

Introduce a new gglobal xfssyncd workqueue, and convert the periodic
work to use this new functionality. To do this, use a delayed work
construct to schedule the next running of the periodic sync work
for the filesystem. When the sync work is complete, queue a new
delayed work for the next running of the sync work.

For laptop mode, we wait on completion for the sync works, so ensure
that the sync work queuing interface can flush and wait for work to
complete to enable the work queue infrastructure to replace the
current sequence number and wakeup that is used.

Because the sync work does non-trivial amounts of work, mark the
new work queue as CPU intensive.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/linux-2.6/xfs_super.c |   38 +++++++++----------
 fs/xfs/linux-2.6/xfs_sync.c  |   86 ++++++++++++++++++++---------------------
 fs/xfs/linux-2.6/xfs_sync.h  |    2 +
 fs/xfs/xfs_mount.h           |    4 +-
 4 files changed, 63 insertions(+), 67 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_super.c b/fs/xfs/linux-2.6/xfs_super.c
index 8a70b2a..768894b 100644
--- a/fs/xfs/linux-2.6/xfs_super.c
+++ b/fs/xfs/linux-2.6/xfs_super.c
@@ -1191,22 +1191,12 @@ xfs_fs_sync_fs(
 		return -error;
 
 	if (laptop_mode) {
-		int	prev_sync_seq = mp->m_sync_seq;
-
 		/*
 		 * The disk must be active because we're syncing.
 		 * We schedule xfssyncd now (now that the disk is
 		 * active) instead of later (when it might not be).
 		 */
-		wake_up_process(mp->m_sync_task);
-		/*
-		 * We have to wait for the sync iteration to complete.
-		 * If we don't, the disk activity caused by the sync
-		 * will come after the sync is completed, and that
-		 * triggers another sync from laptop mode.
-		 */
-		wait_event(mp->m_wait_single_sync_task,
-				mp->m_sync_seq != prev_sync_seq);
+		flush_delayed_work_sync(&mp->m_sync_work);
 	}
 
 	return 0;
@@ -1492,7 +1482,6 @@ xfs_fs_fill_super(
 	atomic_set(&mp->m_active_trans, 0);
 	INIT_LIST_HEAD(&mp->m_sync_list);
 	spin_lock_init(&mp->m_sync_lock);
-	init_waitqueue_head(&mp->m_wait_single_sync_task);
 
 	mp->m_super = sb;
 	sb->s_fs_info = mp;
@@ -1819,26 +1808,34 @@ init_xfs_fs(void)
 	if (error)
 		goto out_cleanup_procfs;
 
+	xfs_syncd_wq = alloc_workqueue("xfssyncd", WQ_CPU_INTENSIVE, 8);
+	if (!xfs_syncd_wq) {
+		error = -ENOMEM;
+		goto out_sysctl_unregister;
+	}
+
 	vfs_initquota();
 
 	error = register_filesystem(&xfs_fs_type);
 	if (error)
-		goto out_sysctl_unregister;
+		goto out_destroy_xfs_syncd;
 	return 0;
 
- out_sysctl_unregister:
+out_destroy_xfs_syncd:
+	destroy_workqueue(xfs_syncd_wq);
+out_sysctl_unregister:
 	xfs_sysctl_unregister();
- out_cleanup_procfs:
+out_cleanup_procfs:
 	xfs_cleanup_procfs();
- out_buf_terminate:
+out_buf_terminate:
 	xfs_buf_terminate();
- out_filestream_uninit:
+out_filestream_uninit:
 	xfs_filestream_uninit();
- out_mru_cache_uninit:
+out_mru_cache_uninit:
 	xfs_mru_cache_uninit();
- out_destroy_zones:
+out_destroy_zones:
 	xfs_destroy_zones();
- out:
+out:
 	return error;
 }
 
@@ -1847,6 +1844,7 @@ exit_xfs_fs(void)
 {
 	vfs_exitquota();
 	unregister_filesystem(&xfs_fs_type);
+	destroy_workqueue(xfs_syncd_wq);
 	xfs_sysctl_unregister();
 	xfs_cleanup_procfs();
 	xfs_buf_terminate();
diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c
index 594cd82..91c513b 100644
--- a/fs/xfs/linux-2.6/xfs_sync.c
+++ b/fs/xfs/linux-2.6/xfs_sync.c
@@ -39,6 +39,8 @@
 #include <linux/kthread.h>
 #include <linux/freezer.h>
 
+struct workqueue_struct	*xfs_syncd_wq;	/* sync workqueue */
+
 /*
  * The inode lookup is done in batches to keep the amount of lock traffic and
  * radix tree lookups to a minimum. The batch size is a trade off between
@@ -489,32 +491,6 @@ xfs_flush_inodes(
 	xfs_log_force(ip->i_mount, XFS_LOG_SYNC);
 }
 
-/*
- * Every sync period we need to unpin all items, reclaim inodes and sync
- * disk quotas.  We might need to cover the log to indicate that the
- * filesystem is idle and not frozen.
- */
-STATIC void
-xfs_sync_worker(
-	struct xfs_mount *mp,
-	void		*unused)
-{
-	int		error;
-
-	if (!(mp->m_flags & XFS_MOUNT_RDONLY)) {
-		/* dgc: errors ignored here */
-		if (mp->m_super->s_frozen == SB_UNFROZEN &&
-		    xfs_log_need_covered(mp))
-			error = xfs_fs_log_dummy(mp);
-		else
-			xfs_log_force(mp, 0);
-		xfs_reclaim_inodes(mp, 0);
-		error = xfs_qm_sync(mp, SYNC_TRYLOCK);
-	}
-	mp->m_sync_seq++;
-	wake_up(&mp->m_wait_single_sync_task);
-}
-
 STATIC int
 xfssyncd(
 	void			*arg)
@@ -535,27 +511,12 @@ xfssyncd(
 			break;
 
 		spin_lock(&mp->m_sync_lock);
-		/*
-		 * We can get woken by laptop mode, to do a sync -
-		 * that's the (only!) case where the list would be
-		 * empty with time remaining.
-		 */
-		if (!timeleft || list_empty(&mp->m_sync_list)) {
-			if (!timeleft)
-				timeleft = xfs_syncd_centisecs *
-							msecs_to_jiffies(10);
-			INIT_LIST_HEAD(&mp->m_sync_work.w_list);
-			list_add_tail(&mp->m_sync_work.w_list,
-					&mp->m_sync_list);
-		}
 		list_splice_init(&mp->m_sync_list, &tmp);
 		spin_unlock(&mp->m_sync_lock);
 
 		list_for_each_entry_safe(work, n, &tmp, w_list) {
 			(*work->w_syncer)(mp, work->w_data);
 			list_del(&work->w_list);
-			if (work == &mp->m_sync_work)
-				continue;
 			if (work->w_completion)
 				complete(work->w_completion);
 			kmem_free(work);
@@ -565,13 +526,49 @@ xfssyncd(
 	return 0;
 }
 
+static void
+xfs_syncd_queue_sync(
+	struct xfs_mount        *mp)
+{
+	queue_delayed_work(xfs_syncd_wq, &mp->m_sync_work,
+				xfs_syncd_centisecs * msecs_to_jiffies(10));
+}
+
+/*
+ * Every sync period we need to unpin all items, reclaim inodes and sync
+ * disk quotas.  We might need to cover the log to indicate that the
+ * filesystem is idle and not frozen.
+ */
+STATIC void
+xfs_sync_worker(
+	struct work_struct *work)
+{
+	struct xfs_mount *mp = container_of(to_delayed_work(work),
+					struct xfs_mount, m_sync_work);
+	int		error;
+
+	if (!(mp->m_flags & XFS_MOUNT_RDONLY)) {
+		/* dgc: errors ignored here */
+		if (mp->m_super->s_frozen == SB_UNFROZEN &&
+		    xfs_log_need_covered(mp))
+			error = xfs_fs_log_dummy(mp);
+		else
+			xfs_log_force(mp, 0);
+		xfs_reclaim_inodes(mp, 0);
+		error = xfs_qm_sync(mp, SYNC_TRYLOCK);
+	}
+
+	/* queue us up again */
+	xfs_syncd_queue_sync(mp);
+}
+
 int
 xfs_syncd_init(
 	struct xfs_mount	*mp)
 {
-	mp->m_sync_work.w_syncer = xfs_sync_worker;
-	mp->m_sync_work.w_mount = mp;
-	mp->m_sync_work.w_completion = NULL;
+	INIT_DELAYED_WORK(&mp->m_sync_work, xfs_sync_worker);
+	xfs_syncd_queue_sync(mp);
+
 	mp->m_sync_task = kthread_run(xfssyncd, mp, "xfssyncd/%s", mp->m_fsname);
 	if (IS_ERR(mp->m_sync_task))
 		return -PTR_ERR(mp->m_sync_task);
@@ -582,6 +579,7 @@ void
 xfs_syncd_stop(
 	struct xfs_mount	*mp)
 {
+	cancel_delayed_work_sync(&mp->m_sync_work);
 	kthread_stop(mp->m_sync_task);
 }
 
diff --git a/fs/xfs/linux-2.6/xfs_sync.h b/fs/xfs/linux-2.6/xfs_sync.h
index 32ba662..e3a6ad2 100644
--- a/fs/xfs/linux-2.6/xfs_sync.h
+++ b/fs/xfs/linux-2.6/xfs_sync.h
@@ -32,6 +32,8 @@ typedef struct xfs_sync_work {
 #define SYNC_WAIT		0x0001	/* wait for i/o to complete */
 #define SYNC_TRYLOCK		0x0002  /* only try to lock inodes */
 
+extern struct workqueue_struct	*xfs_syncd_wq;	/* sync workqueue */
+
 int xfs_syncd_init(struct xfs_mount *mp);
 void xfs_syncd_stop(struct xfs_mount *mp);
 
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index a62e897..2c11e62 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -203,12 +203,10 @@ typedef struct xfs_mount {
 	struct mutex		m_icsb_mutex;	/* balancer sync lock */
 #endif
 	struct xfs_mru_cache	*m_filestream;  /* per-mount filestream data */
+	struct delayed_work	m_sync_work;	/* background sync work */
 	struct task_struct	*m_sync_task;	/* generalised sync thread */
-	xfs_sync_work_t		m_sync_work;	/* work item for VFS_SYNC */
 	struct list_head	m_sync_list;	/* sync thread work item list */
 	spinlock_t		m_sync_lock;	/* work item list lock */
-	int			m_sync_seq;	/* sync thread generation no. */
-	wait_queue_head_t	m_wait_single_sync_task;
 	__int64_t		m_update_flags;	/* sb flags we need to update
 						   on the next remount,rw */
 	struct shrinker		m_inode_shrink;	/* inode reclaim shrinker */
-- 
1.7.2.3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/6] xfs: introduce a xfssyncd workqueue
  2011-03-10  0:05 ` [PATCH 2/6] xfs: introduce a xfssyncd workqueue Dave Chinner
@ 2011-03-10 17:34   ` Christoph Hellwig
  2011-03-18  3:39     ` Dave Chinner
  0 siblings, 1 reply; 22+ messages in thread
From: Christoph Hellwig @ 2011-03-10 17:34 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

> - out_sysctl_unregister:
> +out_destroy_xfs_syncd:
> +	destroy_workqueue(xfs_syncd_wq);
> +out_sysctl_unregister:
>  	xfs_sysctl_unregister();
> - out_cleanup_procfs:
> +out_cleanup_procfs:
>  	xfs_cleanup_procfs();
> - out_buf_terminate:
> +out_buf_terminate:
>  	xfs_buf_terminate();
> - out_filestream_uninit:
> +out_filestream_uninit:
>  	xfs_filestream_uninit();
> - out_mru_cache_uninit:
> +out_mru_cache_uninit:
>  	xfs_mru_cache_uninit();
> - out_destroy_zones:
> +out_destroy_zones:

What's the point of these random formatting changes?


Otherwise looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/6] xfs: introduce a xfssyncd workqueue
  2011-03-10 17:34   ` Christoph Hellwig
@ 2011-03-18  3:39     ` Dave Chinner
  0 siblings, 0 replies; 22+ messages in thread
From: Dave Chinner @ 2011-03-18  3:39 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Thu, Mar 10, 2011 at 12:34:05PM -0500, Christoph Hellwig wrote:
> > - out_sysctl_unregister:
> > +out_destroy_xfs_syncd:
> > +	destroy_workqueue(xfs_syncd_wq);
> > +out_sysctl_unregister:
> >  	xfs_sysctl_unregister();
> > - out_cleanup_procfs:
> > +out_cleanup_procfs:
> >  	xfs_cleanup_procfs();
> > - out_buf_terminate:
> > +out_buf_terminate:
> >  	xfs_buf_terminate();
> > - out_filestream_uninit:
> > +out_filestream_uninit:
> >  	xfs_filestream_uninit();
> > - out_mru_cache_uninit:
> > +out_mru_cache_uninit:
> >  	xfs_mru_cache_uninit();
> > - out_destroy_zones:
> > +out_destroy_zones:
> 
> What's the point of these random formatting changes?
> 

Removed.

-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 3/6] xfs: convert ENOSPC inode flushing to use new syncd workqueue
  2011-03-10  0:05 [PATCH 0/6] xfs: Reduce OOM kill problems under heavy load V2 Dave Chinner
  2011-03-10  0:05 ` [PATCH 1/6] xfs: introduce inode cluster buffer trylocks for xfs_iflush Dave Chinner
  2011-03-10  0:05 ` [PATCH 2/6] xfs: introduce a xfssyncd workqueue Dave Chinner
@ 2011-03-10  0:05 ` Dave Chinner
  2011-03-10 17:35   ` Christoph Hellwig
  2011-03-10  0:05 ` [PATCH 4/6] xfs: introduce background inode reclaim work Dave Chinner
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 22+ messages in thread
From: Dave Chinner @ 2011-03-10  0:05 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

On of the problems with the current inode flush at ENOSPC is that we
queue a flush per ENOSPC event, regardless of how many are already
queued. Thi can result in    hundreds of queued flushes, most of
which simply burn CPU scanned and do no real work. This simply slows
down allocation at ENOSPC.

We really only need one active flush at a time, and we can easily
implement that via the new xfs_syncd_wq. All we need to do is queue
a flush if one is not already active, then block waiting for the
currently active flush to complete. The result is that we only ever
have a single ENOSPC inode flush active at a time and this greatly
reduces the overhead of ENOSPC processing.

On my 2p test machine, this results in tests exercising ENOSPC
conditions running significantly faster - 042 halves execution time,
083 drops from 60s to 5s, etc - while not introducing test
regressions.

This allows us to remove the old xfssyncd threads and infrastructure
as they are no longer used.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/linux-2.6/xfs_super.c |    2 -
 fs/xfs/linux-2.6/xfs_sync.c  |  133 +++++++++++------------------------------
 fs/xfs/xfs_mount.h           |    4 +-
 3 files changed, 37 insertions(+), 102 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_super.c b/fs/xfs/linux-2.6/xfs_super.c
index 768894b..1cbee9c 100644
--- a/fs/xfs/linux-2.6/xfs_super.c
+++ b/fs/xfs/linux-2.6/xfs_super.c
@@ -1480,8 +1480,6 @@ xfs_fs_fill_super(
 	spin_lock_init(&mp->m_sb_lock);
 	mutex_init(&mp->m_growlock);
 	atomic_set(&mp->m_active_trans, 0);
-	INIT_LIST_HEAD(&mp->m_sync_list);
-	spin_lock_init(&mp->m_sync_lock);
 
 	mp->m_super = sb;
 	sb->s_fs_info = mp;
diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c
index 91c513b..62e1f09 100644
--- a/fs/xfs/linux-2.6/xfs_sync.c
+++ b/fs/xfs/linux-2.6/xfs_sync.c
@@ -433,99 +433,6 @@ xfs_quiesce_attr(
 	xfs_unmountfs_writesb(mp);
 }
 
-/*
- * Enqueue a work item to be picked up by the vfs xfssyncd thread.
- * Doing this has two advantages:
- * - It saves on stack space, which is tight in certain situations
- * - It can be used (with care) as a mechanism to avoid deadlocks.
- * Flushing while allocating in a full filesystem requires both.
- */
-STATIC void
-xfs_syncd_queue_work(
-	struct xfs_mount *mp,
-	void		*data,
-	void		(*syncer)(struct xfs_mount *, void *),
-	struct completion *completion)
-{
-	struct xfs_sync_work *work;
-
-	work = kmem_alloc(sizeof(struct xfs_sync_work), KM_SLEEP);
-	INIT_LIST_HEAD(&work->w_list);
-	work->w_syncer = syncer;
-	work->w_data = data;
-	work->w_mount = mp;
-	work->w_completion = completion;
-	spin_lock(&mp->m_sync_lock);
-	list_add_tail(&work->w_list, &mp->m_sync_list);
-	spin_unlock(&mp->m_sync_lock);
-	wake_up_process(mp->m_sync_task);
-}
-
-/*
- * Flush delayed allocate data, attempting to free up reserved space
- * from existing allocations.  At this point a new allocation attempt
- * has failed with ENOSPC and we are in the process of scratching our
- * heads, looking about for more room...
- */
-STATIC void
-xfs_flush_inodes_work(
-	struct xfs_mount *mp,
-	void		*arg)
-{
-	struct inode	*inode = arg;
-	xfs_sync_data(mp, SYNC_TRYLOCK);
-	xfs_sync_data(mp, SYNC_TRYLOCK | SYNC_WAIT);
-	iput(inode);
-}
-
-void
-xfs_flush_inodes(
-	xfs_inode_t	*ip)
-{
-	struct inode	*inode = VFS_I(ip);
-	DECLARE_COMPLETION_ONSTACK(completion);
-
-	igrab(inode);
-	xfs_syncd_queue_work(ip->i_mount, inode, xfs_flush_inodes_work, &completion);
-	wait_for_completion(&completion);
-	xfs_log_force(ip->i_mount, XFS_LOG_SYNC);
-}
-
-STATIC int
-xfssyncd(
-	void			*arg)
-{
-	struct xfs_mount	*mp = arg;
-	long			timeleft;
-	xfs_sync_work_t		*work, *n;
-	LIST_HEAD		(tmp);
-
-	set_freezable();
-	timeleft = xfs_syncd_centisecs * msecs_to_jiffies(10);
-	for (;;) {
-		if (list_empty(&mp->m_sync_list))
-			timeleft = schedule_timeout_interruptible(timeleft);
-		/* swsusp */
-		try_to_freeze();
-		if (kthread_should_stop() && list_empty(&mp->m_sync_list))
-			break;
-
-		spin_lock(&mp->m_sync_lock);
-		list_splice_init(&mp->m_sync_list, &tmp);
-		spin_unlock(&mp->m_sync_lock);
-
-		list_for_each_entry_safe(work, n, &tmp, w_list) {
-			(*work->w_syncer)(mp, work->w_data);
-			list_del(&work->w_list);
-			if (work->w_completion)
-				complete(work->w_completion);
-			kmem_free(work);
-		}
-	}
-
-	return 0;
-}
-
 static void
 xfs_syncd_queue_sync(
 	struct xfs_mount        *mp)
@@ -562,16 +469,48 @@ xfs_sync_worker(
 	xfs_syncd_queue_sync(mp);
 }
 
+/*
+ * Flush delayed allocate data, attempting to free up reserved space
+ * from existing allocations.  At this point a new allocation attempt
+ * has failed with ENOSPC and we are in the process of scratching our
+ * heads, looking about for more room.
+ *
+ * Queue a new data flush if there isn't one already in progress and
+ * wait for completion of the flush. This means that we only ever have one
+ * inode flush in progress no matter how many ENOSPC events are occurring and
+ * so will prevent the system from bogging down due to every concurrent
+ * ENOSPC event scanning all the active inodes in the system for writeback.
+ */
+void
+xfs_flush_inodes(
+	struct xfs_inode	*ip)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+
+	queue_work(xfs_syncd_wq, &mp->m_flush_work);
+	flush_work_sync(&mp->m_flush_work);
+}
+
+STATIC void
+xfs_flush_worker(
+	struct work_struct *work)
+{
+	struct xfs_mount *mp = container_of(work,
+					struct xfs_mount, m_flush_work);
+
+	xfs_sync_data(mp, SYNC_TRYLOCK);
+	xfs_sync_data(mp, SYNC_TRYLOCK | SYNC_WAIT);
+	xfs_log_force(mp, XFS_LOG_SYNC);
+}
+
 int
 xfs_syncd_init(
 	struct xfs_mount	*mp)
 {
+	INIT_WORK(&mp->m_flush_work, xfs_flush_worker);
 	INIT_DELAYED_WORK(&mp->m_sync_work, xfs_sync_worker);
 	xfs_syncd_queue_sync(mp);
 
-	mp->m_sync_task = kthread_run(xfssyncd, mp, "xfssyncd/%s", mp->m_fsname);
-	if (IS_ERR(mp->m_sync_task))
-		return -PTR_ERR(mp->m_sync_task);
 	return 0;
 }
 
@@ -580,7 +519,7 @@ xfs_syncd_stop(
 	struct xfs_mount	*mp)
 {
 	cancel_delayed_work_sync(&mp->m_sync_work);
-	kthread_stop(mp->m_sync_task);
+	cancel_work_sync(&mp->m_flush_work);
 }
 
 void
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 2c11e62..a0ad90e 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -204,9 +204,7 @@ typedef struct xfs_mount {
 #endif
 	struct xfs_mru_cache	*m_filestream;  /* per-mount filestream data */
 	struct delayed_work	m_sync_work;	/* background sync work */
-	struct task_struct	*m_sync_task;	/* generalised sync thread */
-	struct list_head	m_sync_list;	/* sync thread work item list */
-	spinlock_t		m_sync_lock;	/* work item list lock */
+	struct work_struct	m_flush_work;	/* background inode flush */
 	__int64_t		m_update_flags;	/* sb flags we need to update
 						   on the next remount,rw */
 	struct shrinker		m_inode_shrink;	/* inode reclaim shrinker */
-- 
1.7.2.3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/6] xfs: convert ENOSPC inode flushing to use new syncd workqueue
  2011-03-10  0:05 ` [PATCH 3/6] xfs: convert ENOSPC inode flushing to use new syncd workqueue Dave Chinner
@ 2011-03-10 17:35   ` Christoph Hellwig
  2011-03-18  3:39     ` Dave Chinner
  0 siblings, 1 reply; 22+ messages in thread
From: Christoph Hellwig @ 2011-03-10 17:35 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

> +STATIC void
> +xfs_flush_worker(
> +	struct work_struct *work)
> +{
> +	struct xfs_mount *mp = container_of(work,
> +					struct xfs_mount, m_flush_work);
> +
> +	xfs_sync_data(mp, SYNC_TRYLOCK);
> +	xfs_sync_data(mp, SYNC_TRYLOCK | SYNC_WAIT);
> +	xfs_log_force(mp, XFS_LOG_SYNC);

No actually new in this patch: but what's the point of the log force
here?  xfs_sync_data just did one before returning.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/6] xfs: convert ENOSPC inode flushing to use new syncd workqueue
  2011-03-10 17:35   ` Christoph Hellwig
@ 2011-03-18  3:39     ` Dave Chinner
  0 siblings, 0 replies; 22+ messages in thread
From: Dave Chinner @ 2011-03-18  3:39 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Thu, Mar 10, 2011 at 12:35:49PM -0500, Christoph Hellwig wrote:
> Looks good,
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> 
> > +STATIC void
> > +xfs_flush_worker(
> > +	struct work_struct *work)
> > +{
> > +	struct xfs_mount *mp = container_of(work,
> > +					struct xfs_mount, m_flush_work);
> > +
> > +	xfs_sync_data(mp, SYNC_TRYLOCK);
> > +	xfs_sync_data(mp, SYNC_TRYLOCK | SYNC_WAIT);
> > +	xfs_log_force(mp, XFS_LOG_SYNC);
> 
> No actually new in this patch: but what's the point of the log force
> here?  xfs_sync_data just did one before returning.

Historical, I think. Removed.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 4/6] xfs: introduce background inode reclaim work
  2011-03-10  0:05 [PATCH 0/6] xfs: Reduce OOM kill problems under heavy load V2 Dave Chinner
                   ` (2 preceding siblings ...)
  2011-03-10  0:05 ` [PATCH 3/6] xfs: convert ENOSPC inode flushing to use new syncd workqueue Dave Chinner
@ 2011-03-10  0:05 ` Dave Chinner
  2011-03-10 17:40   ` Christoph Hellwig
  2011-03-10  0:05 ` [PATCH 5/6] xfs: convert the xfsaild threads to a workqueue Dave Chinner
  2011-03-10  0:05 ` [PATCH 6/6] xfs: push the AIL from memory reclaim and periodic sync Dave Chinner
  5 siblings, 1 reply; 22+ messages in thread
From: Dave Chinner @ 2011-03-10  0:05 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

Background inode reclaim needs to run more frequently that the XFS
syncd work is run as 30s is too long between optimal reclaim runs.
Add a new periodic work item to the xfs syncd workqueue to run a
fast, non-blocking inode reclaim scan.

To make memory reclaim based inode reclaim throttle to inode
cleaning but still reclaim inodes efficiently, make it kick the
background inode reclaim so that when we are low on memory we are
trying to reclaim inodes as efficiently as possible. To contrast
this, make the shrinker past do synchronous inode reclaim so that it
blocks on inodes under IO. This means that the shrinker will reclaim
inodes rather than just skipping over them, but it does not
adversely affect the rate of reclaim because most dirty inodes are
already under IO due to the background reclaim work the shrinker
kicked.

These two modifications solve one of the two OOM killer invocations
Chris Mason reported recently when running a stress testing script.
The particular workload trigger for the OOM killer invocation is
where there are more threads than CPUs all unlinking files in an
extremely memory constrained environment. Unlike other solutions,
this one does not have a performance impact on performance when
memory is not constrained or the number of concurrent threads
operating is <= to the number of CPUs.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/linux-2.6/xfs_sync.c |   56 +++++++++++++++++++++++++++++++++++++++++-
 fs/xfs/xfs_mount.h          |    1 +
 2 files changed, 55 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c
index 62e1f09..3f39d83 100644
--- a/fs/xfs/linux-2.6/xfs_sync.c
+++ b/fs/xfs/linux-2.6/xfs_sync.c
@@ -470,6 +470,44 @@ xfs_sync_worker(
 }
 
 /*
+ * Queue a new inode reclaim pass if there isn't one already in progress.
+ * Wait for completion of the flush if necessary.
+ */
+static void
+xfs_syncd_queue_reclaim(
+	struct xfs_mount        *mp)
+{
+	queue_delayed_work(xfs_syncd_wq, &mp->m_reclaim_work,
+			xfs_syncd_centisecs / 5 * msecs_to_jiffies(10));
+}
+
+/*
+ * This is a fast pass over the inode cache to try to get reclaim moving on as
+ * many inodes as possible in a short period of time. It kicks itself every few
+ * seconds, as well as being kicked by the inode cache shrinker when memory
+ * goes low.
+ */
+STATIC void
+xfs_reclaim_worker(
+	struct work_struct *work)
+{
+	struct xfs_mount *mp = container_of(to_delayed_work(work),
+					struct xfs_mount, m_reclaim_work);
+
+	/* first unpin all the dirty and stale inodes. */
+	xfs_log_force(mp, XFS_LOG_SYNC);
+
+	/*
+	 * now scan as quickly as possible, not getting hung up on locked
+	 * inodes or those that are already flushing.
+	 */
+	xfs_reclaim_inodes(mp, SYNC_TRYLOCK);
+
+	/* queue us up again */
+	xfs_syncd_queue_reclaim(mp);
+}
+
+/*
  * Flush delayed allocate data, attempting to free up reserved space
  * from existing allocations.  At this point a new allocation attempt
  * has failed with ENOSPC and we are in the process of scratching our
@@ -509,7 +547,10 @@ xfs_syncd_init(
 {
 	INIT_WORK(&mp->m_flush_work, xfs_flush_worker);
 	INIT_DELAYED_WORK(&mp->m_sync_work, xfs_sync_worker);
+	INIT_DELAYED_WORK(&mp->m_reclaim_work, xfs_reclaim_worker);
+
 	xfs_syncd_queue_sync(mp);
+	xfs_syncd_queue_reclaim(mp);
 
 	return 0;
 }
@@ -519,6 +560,7 @@ xfs_syncd_stop(
 	struct xfs_mount	*mp)
 {
 	cancel_delayed_work_sync(&mp->m_sync_work);
+	cancel_delayed_work_sync(&mp->m_reclaim_work);
 	cancel_work_sync(&mp->m_flush_work);
 }
 
@@ -954,7 +996,13 @@ xfs_reclaim_inodes(
 }
 
 /*
- * Shrinker infrastructure.
+ * Inode cache shrinker.
+ *
+ * When called we make sure that there is a background (fast) inode reclaim in
+ * progress, while we will throttle the speed of reclaim via doiing synchronous
+ * reclaim of inodes. That means if we come across dirty inodes, we wait for
+ * them to be cleaned, which we hope will not be very long due to the
+ * background walker having already kicked the IO off on those dirty inodes.
  */
 static int
 xfs_reclaim_inode_shrink(
@@ -969,10 +1017,14 @@ xfs_reclaim_inode_shrink(
 
 	mp = container_of(shrink, struct xfs_mount, m_inode_shrink);
 	if (nr_to_scan) {
+		/* kick background reclaimer */
+		xfs_syncd_queue_reclaim(mp);
+
 		if (!(gfp_mask & __GFP_FS))
 			return -1;
 
-		xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK, &nr_to_scan);
+		xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK | SYNC_WAIT,
+					&nr_to_scan);
 		/* terminate if we don't exhaust the scan */
 		if (nr_to_scan > 0)
 			return -1;
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index a0ad90e..19af0ab 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -204,6 +204,7 @@ typedef struct xfs_mount {
 #endif
 	struct xfs_mru_cache	*m_filestream;  /* per-mount filestream data */
 	struct delayed_work	m_sync_work;	/* background sync work */
+	struct delayed_work	m_reclaim_work;	/* background inode reclaim */
 	struct work_struct	m_flush_work;	/* background inode flush */
 	__int64_t		m_update_flags;	/* sb flags we need to update
 						   on the next remount,rw */
-- 
1.7.2.3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/6] xfs: introduce background inode reclaim work
  2011-03-10  0:05 ` [PATCH 4/6] xfs: introduce background inode reclaim work Dave Chinner
@ 2011-03-10 17:40   ` Christoph Hellwig
  2011-03-18  4:00     ` Dave Chinner
  0 siblings, 1 reply; 22+ messages in thread
From: Christoph Hellwig @ 2011-03-10 17:40 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

Why do we still keep the inode reclaim in the syncer work?  If we
already have this one doing it I don't think we need it there as well.

>  /*
> + * Queue a new inode reclaim pass if there isn't one already in progress.
> + * Wait for completion of the flush if necessary.

No, it doesn't wait ever.

> + */
> +static void
> +xfs_syncd_queue_reclaim(
> +	struct xfs_mount        *mp)
> +{
> +	queue_delayed_work(xfs_syncd_wq, &mp->m_reclaim_work,
> +			xfs_syncd_centisecs / 5 * msecs_to_jiffies(10));

What explanation is there for the magic 5?  E.g. why do we neeed to run
it exactly 5 times as often as the normal sync work? Should it have it's
own tunable?  And isn't ever 6 seconds by default a little often on
systems trying to saver power, especiall if there aren't any inodes to
reclaim?  Should we trigger starting this work off having reclaimable
inodes tagged in the radix tree?

> +	/* first unpin all the dirty and stale inodes. */
> +	xfs_log_force(mp, XFS_LOG_SYNC);

So we force out the log every 6 seconds.  That's a lot more often than
most other filesystem and might have adverse performance impact.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/6] xfs: introduce background inode reclaim work
  2011-03-10 17:40   ` Christoph Hellwig
@ 2011-03-18  4:00     ` Dave Chinner
  0 siblings, 0 replies; 22+ messages in thread
From: Dave Chinner @ 2011-03-18  4:00 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Thu, Mar 10, 2011 at 12:40:54PM -0500, Christoph Hellwig wrote:
> Why do we still keep the inode reclaim in the syncer work?  If we
> already have this one doing it I don't think we need it there as well.

Ok. Removed.

> >  /*
> > + * Queue a new inode reclaim pass if there isn't one already in progress.
> > + * Wait for completion of the flush if necessary.
> 
> No, it doesn't wait ever.

Removed.

> > + */
> > +static void
> > +xfs_syncd_queue_reclaim(
> > +	struct xfs_mount        *mp)
> > +{
> > +	queue_delayed_work(xfs_syncd_wq, &mp->m_reclaim_work,
> > +			xfs_syncd_centisecs / 5 * msecs_to_jiffies(10));
> 
> What explanation is there for the magic 5?  E.g. why do we neeed to run
> it exactly 5 times as often as the normal sync work?

It doesn't have to be exactly 5x more frequent, just that it needs
to run quite a bit more often than the normal sync work. tens times
more often seems like overkill and a lot of overhead given the scan
reclaim does, while two times more often isn't sufficient to avoid
significant build up of dirty reclaimable inodes that need to be
written before they can be reclaimed..

> Should it have it's
> own tunable?

Perhaps. I'm not convinced it is necessary, though.

> And isn't ever 6 seconds by default a little often on
> systems trying to saver power, especiall if there aren't any inodes to
> reclaim?  Should we trigger starting this work off having reclaimable
> inodes tagged in the radix tree?

Yes, probably should. I'll see if I can do that easily.


> > +	/* first unpin all the dirty and stale inodes. */
> > +	xfs_log_force(mp, XFS_LOG_SYNC);
> 
> So we force out the log every 6 seconds.  That's a lot more often than
> most other filesystem and might have adverse performance impact.

I'll remove that and leave it for the sync inode reclaim to force
out the log...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 5/6] xfs: convert the xfsaild threads to a workqueue
  2011-03-10  0:05 [PATCH 0/6] xfs: Reduce OOM kill problems under heavy load V2 Dave Chinner
                   ` (3 preceding siblings ...)
  2011-03-10  0:05 ` [PATCH 4/6] xfs: introduce background inode reclaim work Dave Chinner
@ 2011-03-10  0:05 ` Dave Chinner
  2011-03-10 17:48   ` Christoph Hellwig
  2011-03-10  0:05 ` [PATCH 6/6] xfs: push the AIL from memory reclaim and periodic sync Dave Chinner
  5 siblings, 1 reply; 22+ messages in thread
From: Dave Chinner @ 2011-03-10  0:05 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

Similar to the xfssyncd, the per-filesystem xfsaild threads can be
converted to a global workqueue and run periodically by delayed
works. This makes sense for the AIL pushing because it uses
variable timeouts depending on the work that needs to be done.

By removing the xfsaild, we simplify the AIL pushing code and
remove the need to spread the code to implement the threading
and pushing across multiple files.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/linux-2.6/xfs_super.c |  109 ++++++++++-------------------------
 fs/xfs/xfs_trans_ail.c       |  130 ++++++++++++++++++++++++------------------
 fs/xfs/xfs_trans_priv.h      |   11 ++--
 3 files changed, 111 insertions(+), 139 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_super.c b/fs/xfs/linux-2.6/xfs_super.c
index 1cbee9c..b0bcfb3 100644
--- a/fs/xfs/linux-2.6/xfs_super.c
+++ b/fs/xfs/linux-2.6/xfs_super.c
@@ -816,75 +816,6 @@ xfs_setup_devices(
 	return 0;
 }
 
-/*
- * XFS AIL push thread support
- */
-void
-xfsaild_wakeup(
-	struct xfs_ail		*ailp,
-	xfs_lsn_t		threshold_lsn)
-{
-	/* only ever move the target forwards */
-	if (XFS_LSN_CMP(threshold_lsn, ailp->xa_target) > 0) {
-		ailp->xa_target = threshold_lsn;
-		wake_up_process(ailp->xa_task);
-	}
-}
-
-STATIC int
-xfsaild(
-	void	*data)
-{
-	struct xfs_ail	*ailp = data;
-	xfs_lsn_t	last_pushed_lsn = 0;
-	long		tout = 0; /* milliseconds */
-
-	while (!kthread_should_stop()) {
-		/*
-		 * for short sleeps indicating congestion, don't allow us to
-		 * get woken early. Otherwise all we do is bang on the AIL lock
-		 * without making progress.
-		 */
-		if (tout && tout <= 20)
-			__set_current_state(TASK_KILLABLE);
-		else
-			__set_current_state(TASK_INTERRUPTIBLE);
-		schedule_timeout(tout ?
-				 msecs_to_jiffies(tout) : MAX_SCHEDULE_TIMEOUT);
-
-		/* swsusp */
-		try_to_freeze();
-
-		ASSERT(ailp->xa_mount->m_log);
-		if (XFS_FORCED_SHUTDOWN(ailp->xa_mount))
-			continue;
-
-		tout = xfsaild_push(ailp, &last_pushed_lsn);
-	}
-
-	return 0;
-}	/* xfsaild */
-
-int
-xfsaild_start(
-	struct xfs_ail	*ailp)
-{
-	ailp->xa_target = 0;
-	ailp->xa_task = kthread_run(xfsaild, ailp, "xfsaild/%s",
-				    ailp->xa_mount->m_fsname);
-	if (IS_ERR(ailp->xa_task))
-		return -PTR_ERR(ailp->xa_task);
-	return 0;
-}
-
-void
-xfsaild_stop(
-	struct xfs_ail	*ailp)
-{
-	kthread_stop(ailp->xa_task);
-}
-
-
 /* Catch misguided souls that try to use this interface on XFS */
 STATIC struct inode *
 xfs_fs_alloc_inode(
@@ -1772,6 +1703,32 @@ xfs_destroy_zones(void)
 }
 
 STATIC int __init
+xfs_init_workqueues(void)
+{
+	xfs_syncd_wq = alloc_workqueue("xfssyncd", WQ_CPU_INTENSIVE, 8);
+	if (!xfs_syncd_wq)
+		goto out;
+
+	xfs_ail_wq = alloc_workqueue("xfsail", WQ_CPU_INTENSIVE, 8);
+	if (!xfs_ail_wq)
+		goto out_destroy_syncd;
+
+	return 0;
+
+out_destroy_syncd:
+	destroy_workqueue(xfs_syncd_wq);
+out:
+	return -ENOMEM;
+}
+
+STATIC void __exit
+xfs_destroy_workqueues(void)
+{
+	destroy_workqueue(xfs_ail_wq);
+	destroy_workqueue(xfs_syncd_wq);
+}
+
+STATIC int __init
 init_xfs_fs(void)
 {
 	int			error;
@@ -1806,21 +1763,19 @@ init_xfs_fs(void)
 	if (error)
 		goto out_cleanup_procfs;
 
-	xfs_syncd_wq = alloc_workqueue("xfssyncd", WQ_CPU_INTENSIVE, 8);
-	if (!xfs_syncd_wq) {
-		error = -ENOMEM;
+	error = xfs_init_workqueues();
+	if (error)
 		goto out_sysctl_unregister;
-	}
 
 	vfs_initquota();
 
 	error = register_filesystem(&xfs_fs_type);
 	if (error)
-		goto out_destroy_xfs_syncd;
+		goto out_destroy_wq;
 	return 0;
 
-out_destroy_xfs_syncd:
-	destroy_workqueue(xfs_syncd_wq);
+out_destroy_wq:
+	xfs_destroy_workqueues();
 out_sysctl_unregister:
 	xfs_sysctl_unregister();
 out_cleanup_procfs:
@@ -1842,7 +1797,7 @@ exit_xfs_fs(void)
 {
 	vfs_exitquota();
 	unregister_filesystem(&xfs_fs_type);
-	destroy_workqueue(xfs_syncd_wq);
+	xfs_destroy_workqueues();
 	xfs_sysctl_unregister();
 	xfs_cleanup_procfs();
 	xfs_buf_terminate();
diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index 12aff95..8b68679 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -28,6 +28,8 @@
 #include "xfs_trans_priv.h"
 #include "xfs_error.h"
 
+struct workqueue_struct	*xfs_ail_wq;	/* AIL workqueue */
+
 STATIC void xfs_ail_splice(struct xfs_ail *, struct list_head *, xfs_lsn_t);
 STATIC void xfs_ail_delete(struct xfs_ail *, xfs_log_item_t *);
 STATIC xfs_log_item_t * xfs_ail_min(struct xfs_ail *);
@@ -69,36 +71,6 @@ xfs_trans_ail_tail(
 }
 
 /*
- * xfs_trans_push_ail
- *
- * This routine is called to move the tail of the AIL forward.  It does this by
- * trying to flush items in the AIL whose lsns are below the given
- * threshold_lsn.
- *
- * the push is run asynchronously in a separate thread, so we return the tail
- * of the log right now instead of the tail after the push. This means we will
- * either continue right away, or we will sleep waiting on the async thread to
- * do its work.
- *
- * We do this unlocked - we only need to know whether there is anything in the
- * AIL at the time we are called. We don't need to access the contents of
- * any of the objects, so the lock is not needed.
- */
-void
-xfs_trans_ail_push(
-	struct xfs_ail	*ailp,
-	xfs_lsn_t	threshold_lsn)
-{
-	xfs_log_item_t	*lip;
-
-	lip = xfs_ail_min(ailp);
-	if (lip && !XFS_FORCED_SHUTDOWN(ailp->xa_mount)) {
-		if (XFS_LSN_CMP(threshold_lsn, ailp->xa_target) > 0)
-			xfsaild_wakeup(ailp, threshold_lsn);
-	}
-}
-
-/*
  * AIL traversal cursor initialisation.
  *
  * The cursor keeps track of where our current traversal is up
@@ -236,16 +208,16 @@ out:
 }
 
 /*
- * xfsaild_push does the work of pushing on the AIL.  Returning a timeout of
- * zero indicates that the caller should sleep until woken.
+ * xfs_ail_worker does the work of pushing on the AIL. It will requeue itself
+ * to run at a later time if there is more work to do to complete the push.
  */
-long
-xfsaild_push(
-	struct xfs_ail	*ailp,
-	xfs_lsn_t	*last_lsn)
+STATIC void
+xfs_ail_worker(
+	struct work_struct *work)
 {
+	struct xfs_ail	*ailp = container_of(to_delayed_work(work),
+					struct xfs_ail, xa_work);
 	long		tout = 0;
-	xfs_lsn_t	last_pushed_lsn = *last_lsn;
 	xfs_lsn_t	target =  ailp->xa_target;
 	xfs_lsn_t	lsn;
 	xfs_log_item_t	*lip;
@@ -256,15 +228,15 @@ xfsaild_push(
 
 	spin_lock(&ailp->xa_lock);
 	xfs_trans_ail_cursor_init(ailp, cur);
-	lip = xfs_trans_ail_cursor_first(ailp, cur, *last_lsn);
+	lip = xfs_trans_ail_cursor_first(ailp, cur, ailp->xa_last_pushed_lsn);
 	if (!lip || XFS_FORCED_SHUTDOWN(mp)) {
 		/*
 		 * AIL is empty or our push has reached the end.
 		 */
 		xfs_trans_ail_cursor_done(ailp, cur);
 		spin_unlock(&ailp->xa_lock);
-		*last_lsn = 0;
-		return tout;
+		ailp->xa_last_pushed_lsn = 0;
+		return;
 	}
 
 	XFS_STATS_INC(xs_push_ail);
@@ -301,13 +273,13 @@ xfsaild_push(
 		case XFS_ITEM_SUCCESS:
 			XFS_STATS_INC(xs_push_ail_success);
 			IOP_PUSH(lip);
-			last_pushed_lsn = lsn;
+			ailp->xa_last_pushed_lsn = lsn;
 			break;
 
 		case XFS_ITEM_PUSHBUF:
 			XFS_STATS_INC(xs_push_ail_pushbuf);
 			IOP_PUSHBUF(lip);
-			last_pushed_lsn = lsn;
+			ailp->xa_last_pushed_lsn = lsn;
 			push_xfsbufd = 1;
 			break;
 
@@ -319,7 +291,7 @@ xfsaild_push(
 
 		case XFS_ITEM_LOCKED:
 			XFS_STATS_INC(xs_push_ail_locked);
-			last_pushed_lsn = lsn;
+			ailp->xa_last_pushed_lsn = lsn;
 			stuck++;
 			break;
 
@@ -376,7 +348,7 @@ xfsaild_push(
 
 	if (!count) {
 		/* We're past our target or empty, so idle */
-		last_pushed_lsn = 0;
+		ailp->xa_last_pushed_lsn = 0;
 	} else if (XFS_LSN_CMP(lsn, target) >= 0) {
 		/*
 		 * We reached the target so wait a bit longer for I/O to
@@ -384,7 +356,7 @@ xfsaild_push(
 		 * start the next scan from the start of the AIL.
 		 */
 		tout = 50;
-		last_pushed_lsn = 0;
+		ailp->xa_last_pushed_lsn = 0;
 	} else if ((stuck * 100) / count > 90) {
 		/*
 		 * Either there is a lot of contention on the AIL or we
@@ -400,10 +372,63 @@ xfsaild_push(
 		/* more to do, but wait a short while before continuing */
 		tout = 10;
 	}
-	*last_lsn = last_pushed_lsn;
-	return tout;
+
+	/*
+	 * If there is more to do, requeue us. Otherwise the next push will
+	 * start us again.
+	 */
+	if (tout)
+		queue_delayed_work(xfs_syncd_wq, &ailp->xa_work, tout);
 }
 
+/*
+ * Kick the AIL workqueue into action.
+ *
+ * If we already have work in progress, we do not queue new work, simply move
+ * the target forwards. Otherwise queue for the given amount of time in the
+ * future.
+ */
+static void
+xfs_ail_push_queue(
+	struct xfs_ail		*ailp,
+	xfs_lsn_t		threshold_lsn,
+	int			tout)
+{
+	if (XFS_LSN_CMP(threshold_lsn, ailp->xa_target) > 0) {
+		ailp->xa_target = threshold_lsn;
+		queue_delayed_work(xfs_syncd_wq, &ailp->xa_work, tout);
+	}
+}
+
+/*
+ * xfs_trans_push_ail
+ *
+ * This routine is called to move the tail of the AIL forward.  It does this by
+ * trying to flush items in the AIL whose lsns are below the given
+ * threshold_lsn.
+ *
+ * the push is run asynchronously in a workqueue, so we return the tail
+ * of the log right now instead of the tail after the push. This means we will
+ * either continue right away, or we will sleep waiting on the async thread to
+ * do its work. We don't want the workqueue to interrupt any backoff it is
+ * current engaged in, so start the flush in a short while rather than
+ * immediately.
+ *
+ * We do this unlocked - we only need to know whether there is anything in the
+ * AIL at the time we are called. We don't need to access the contents of
+ * any of the objects, so the lock is not needed.
+ */
+void
+xfs_trans_ail_push(
+	struct xfs_ail	*ailp,
+	xfs_lsn_t	threshold_lsn)
+{
+	xfs_log_item_t	*lip;
+
+	lip = xfs_ail_min(ailp);
+	if (lip && !XFS_FORCED_SHUTDOWN(ailp->xa_mount))
+		xfs_ail_push_queue(ailp, threshold_lsn, 1);
+}
 
 /*
  * This is to be called when an item is unlocked that may have
@@ -615,7 +640,6 @@ xfs_trans_ail_init(
 	xfs_mount_t	*mp)
 {
 	struct xfs_ail	*ailp;
-	int		error;
 
 	ailp = kmem_zalloc(sizeof(struct xfs_ail), KM_MAYFAIL);
 	if (!ailp)
@@ -624,15 +648,9 @@ xfs_trans_ail_init(
 	ailp->xa_mount = mp;
 	INIT_LIST_HEAD(&ailp->xa_ail);
 	spin_lock_init(&ailp->xa_lock);
-	error = xfsaild_start(ailp);
-	if (error)
-		goto out_free_ailp;
+	INIT_DELAYED_WORK(&ailp->xa_work, xfs_ail_worker);
 	mp->m_ail = ailp;
 	return 0;
-
-out_free_ailp:
-	kmem_free(ailp);
-	return error;
 }
 
 void
@@ -641,7 +659,7 @@ xfs_trans_ail_destroy(
 {
 	struct xfs_ail	*ailp = mp->m_ail;
 
-	xfsaild_stop(ailp);
+	cancel_delayed_work_sync(&ailp->xa_work);
 	kmem_free(ailp);
 }
 
diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h
index 35162c2..2410da4 100644
--- a/fs/xfs/xfs_trans_priv.h
+++ b/fs/xfs/xfs_trans_priv.h
@@ -66,15 +66,19 @@ struct xfs_ail {
 	struct xfs_mount	*xa_mount;
 	struct list_head	xa_ail;
 	uint			xa_gen;
-	struct task_struct	*xa_task;
 	xfs_lsn_t		xa_target;
 	struct xfs_ail_cursor	xa_cursors;
 	spinlock_t		xa_lock;
+	struct delayed_work	xa_work;
+	xfs_lsn_t		xa_last_pushed_lsn;
 };
 
 /*
  * From xfs_trans_ail.c
  */
+
+extern struct workqueue_struct	*xfs_ail_wq;	/* AIL workqueue */
+
 void	xfs_trans_ail_update_bulk(struct xfs_ail *ailp,
 				struct xfs_log_item **log_items, int nr_items,
 				xfs_lsn_t lsn) __releases(ailp->xa_lock);
@@ -112,11 +116,6 @@ struct xfs_log_item	*xfs_trans_ail_cursor_next(struct xfs_ail *ailp,
 void			xfs_trans_ail_cursor_done(struct xfs_ail *ailp,
 					struct xfs_ail_cursor *cur);
 
-long	xfsaild_push(struct xfs_ail *, xfs_lsn_t *);
-void	xfsaild_wakeup(struct xfs_ail *, xfs_lsn_t);
-int	xfsaild_start(struct xfs_ail *);
-void	xfsaild_stop(struct xfs_ail *);
-
 #if BITS_PER_LONG != 64
 static inline void
 xfs_trans_ail_copy_lsn(
-- 
1.7.2.3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 5/6] xfs: convert the xfsaild threads to a workqueue
  2011-03-10  0:05 ` [PATCH 5/6] xfs: convert the xfsaild threads to a workqueue Dave Chinner
@ 2011-03-10 17:48   ` Christoph Hellwig
  2011-03-18  4:06     ` Dave Chinner
  0 siblings, 1 reply; 22+ messages in thread
From: Christoph Hellwig @ 2011-03-10 17:48 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

>  STATIC int __init
> +xfs_init_workqueues(void)

> +STATIC void __exit
> +xfs_destroy_workqueues(void)

I don't think these helpers are overly useful.

> +	xfs_ail_wq = alloc_workqueue("xfsail", WQ_CPU_INTENSIVE, 8);

> +}


> +static void
> +xfs_ail_push_queue(
> +	struct xfs_ail		*ailp,
> +	xfs_lsn_t		threshold_lsn,
> +	int			tout)
> +{
> +	if (XFS_LSN_CMP(threshold_lsn, ailp->xa_target) > 0) {
> +		ailp->xa_target = threshold_lsn;
> +		queue_delayed_work(xfs_syncd_wq, &ailp->xa_work, tout);

tout is always one in the only caller and thus doesn't need to be
passed.  But I think you really want a timeout of 0 here to queue it up
ASAP (it translates to a direct queue_work() call internally).

Also this function could simply be merged into it's only and relatively
simple caller.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 5/6] xfs: convert the xfsaild threads to a workqueue
  2011-03-10 17:48   ` Christoph Hellwig
@ 2011-03-18  4:06     ` Dave Chinner
  2011-03-19 13:45       ` Christoph Hellwig
  0 siblings, 1 reply; 22+ messages in thread
From: Dave Chinner @ 2011-03-18  4:06 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Thu, Mar 10, 2011 at 12:48:18PM -0500, Christoph Hellwig wrote:
> >  STATIC int __init
> > +xfs_init_workqueues(void)
> 
> > +STATIC void __exit
> > +xfs_destroy_workqueues(void)
> 
> I don't think these helpers are overly useful.

I'm thinking of adding a few more workqueues, so I though I'd split
them out like zone initialisation at the outset....

> 
> > +	xfs_ail_wq = alloc_workqueue("xfsail", WQ_CPU_INTENSIVE, 8);
> 
> > +}
> 
> 
> > +static void
> > +xfs_ail_push_queue(
> > +	struct xfs_ail		*ailp,
> > +	xfs_lsn_t		threshold_lsn,
> > +	int			tout)
> > +{
> > +	if (XFS_LSN_CMP(threshold_lsn, ailp->xa_target) > 0) {
> > +		ailp->xa_target = threshold_lsn;
> > +		queue_delayed_work(xfs_syncd_wq, &ailp->xa_work, tout);
> 
> tout is always one in the only caller and thus doesn't need to be
> passed.  But I think you really want a timeout of 0 here to queue it up
> ASAP (it translates to a direct queue_work() call internally).
> 
> Also this function could simply be merged into it's only and relatively
> simple caller.

It gets used by a second caller in the next patch that uses a
timeout of zero. The idea of adding a delay to a normal push is to
rate limit the number of times we do work so we always work on
batches rather a few items at a time in multiple executions of the
work.

I'll see if it's simpler to just do this work directly in teh
callers, though.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 5/6] xfs: convert the xfsaild threads to a workqueue
  2011-03-18  4:06     ` Dave Chinner
@ 2011-03-19 13:45       ` Christoph Hellwig
  2011-04-03  0:38         ` Dave Chinner
  0 siblings, 1 reply; 22+ messages in thread
From: Christoph Hellwig @ 2011-03-19 13:45 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, xfs

On Fri, Mar 18, 2011 at 03:06:48PM +1100, Dave Chinner wrote:
> It gets used by a second caller in the next patch that uses a
> timeout of zero. The idea of adding a delay to a normal push is to
> rate limit the number of times we do work so we always work on
> batches rather a few items at a time in multiple executions of the
> work.
>
> I'll see if it's simpler to just do this work directly in teh
> callers, though.

I don't think hiding this delay (uncommented) in the workqueue use is
a good idea.  xlog_grant_push_ail has all the logics about when to push
the AIL, so any batching should be grouped with that logic, and
documented there.  It in fact already has some comments static that
a min/max watermark scheme would be useful.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 5/6] xfs: convert the xfsaild threads to a workqueue
  2011-03-19 13:45       ` Christoph Hellwig
@ 2011-04-03  0:38         ` Dave Chinner
  2011-04-03 10:59           ` Christoph Hellwig
  0 siblings, 1 reply; 22+ messages in thread
From: Dave Chinner @ 2011-04-03  0:38 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Sat, Mar 19, 2011 at 09:45:00AM -0400, Christoph Hellwig wrote:
> On Fri, Mar 18, 2011 at 03:06:48PM +1100, Dave Chinner wrote:
> > It gets used by a second caller in the next patch that uses a
> > timeout of zero. The idea of adding a delay to a normal push is to
> > rate limit the number of times we do work so we always work on
> > batches rather a few items at a time in multiple executions of the
> > work.
> >
> > I'll see if it's simpler to just do this work directly in teh
> > callers, though.
> 
> I don't think hiding this delay (uncommented) in the workqueue use is
> a good idea.

FWIW, we already have an implicit delay for frequent callers when
the AIL is busy - the uninterruptible sleep for  sleeps of <= 20ms.
That was implemented specifically to rate-limit wakeups while the
xfsaild was busy pushing. This is essentially a different
implementation of the same mechanism.

> xlog_grant_push_ail has all the logics about when to push
> the AIL, so any batching should be grouped with that logic, and
> documented there.  It in fact already has some comments static that
> a min/max watermark scheme would be useful.

Yes, it does, but that's a much bigger change that has some
potentially nasty problems like ensuring the watermarks are always a
sane distance apart which is difficult to do on small logs were a
single transaction reservation can easily be larger than 10% of the
log. Hence watermarks are a much harder change to validate and tune
compared to a simple push wakeup rate-limit...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 5/6] xfs: convert the xfsaild threads to a workqueue
  2011-04-03  0:38         ` Dave Chinner
@ 2011-04-03 10:59           ` Christoph Hellwig
  2011-04-04  2:11             ` Dave Chinner
  0 siblings, 1 reply; 22+ messages in thread
From: Christoph Hellwig @ 2011-04-03 10:59 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, xfs

On Sun, Apr 03, 2011 at 10:38:47AM +1000, Dave Chinner wrote:
> FWIW, we already have an implicit delay for frequent callers when
> the AIL is busy - the uninterruptible sleep for  sleeps of <= 20ms.
> That was implemented specifically to rate-limit wakeups while the
> xfsaild was busy pushing. This is essentially a different
> implementation of the same mechanism.

In that cases maybe the minum delay of the delayed work should stay
at 20ms?

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 5/6] xfs: convert the xfsaild threads to a workqueue
  2011-04-03 10:59           ` Christoph Hellwig
@ 2011-04-04  2:11             ` Dave Chinner
  0 siblings, 0 replies; 22+ messages in thread
From: Dave Chinner @ 2011-04-04  2:11 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Sun, Apr 03, 2011 at 06:59:57AM -0400, Christoph Hellwig wrote:
> On Sun, Apr 03, 2011 at 10:38:47AM +1000, Dave Chinner wrote:
> > FWIW, we already have an implicit delay for frequent callers when
> > the AIL is busy - the uninterruptible sleep for  sleeps of <= 20ms.
> > That was implemented specifically to rate-limit wakeups while the
> > xfsaild was busy pushing. This is essentially a different
> > implementation of the same mechanism.
> 
> In that cases maybe the minum delay of the delayed work should stay
> at 20ms?

I've been looking at this a little more - I don't think the
work_pending() check is sufficient. The pending bit is cleared
before the work function is called, so while the work is executing
we can queue up another work. That means when the work completes and
decides it needs to back off for a short while, it won't back off
because new work has been been queued and the pending bit has been
set.

Hence I thinkthis needs an external "work running" bit to be set so
that the backoff works as expected. That is, when pushing the tail
we need to do:

	ailp->xa_target = threshold_lsn;
	if (test_and_set_bit(ailp->pushing))
		queue_delayed_work(work, 0);

When the AIL work function needs to back off for a certain timeout,
it just does:

	queue_delayed_work(work, tout);

and when it completes the push, it does:

	clear_bit(ailp->pushing);

So the next tail push requeues the work again. That gets around the
need for implicit wakeup rate limiting, allows tail pushing to keep
moving the target forwards while the work is running, and doesn't
allow tail pushing to impact on a running push.

I'll modify it to do this - it should also simplify the code as
well.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 6/6] xfs: push the AIL from memory reclaim and periodic sync
  2011-03-10  0:05 [PATCH 0/6] xfs: Reduce OOM kill problems under heavy load V2 Dave Chinner
                   ` (4 preceding siblings ...)
  2011-03-10  0:05 ` [PATCH 5/6] xfs: convert the xfsaild threads to a workqueue Dave Chinner
@ 2011-03-10  0:05 ` Dave Chinner
  2011-03-10 21:31   ` Christoph Hellwig
  5 siblings, 1 reply; 22+ messages in thread
From: Dave Chinner @ 2011-03-10  0:05 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

When we are short on memory, we want to expedite the cleaning of
dirty objects.  Hence when we run short on memory, we need to kick
the AIL flushing into action to clean as many dirty objects as
quickly as possible.  To implement this, sample the lsn of the log
item at the head of the AIL and use that as the push target for the
AIL flush.

Further, we keep items in the AIL that are dirty that are not
tracked any other way, so we can get objects sitting in the AIL that
don't get written back until the AIL is pushed. Hence to get the
filesystem to the idle state, we might need to push the AIL to flush
out any remaining dirty objects sitting in the AIL. This requires
the same push mechanism as the reclaim push.

This patch rearranges the location of functions in xfs_trans_ail.c
to remove the need for forward declarations of those functions and
new functions added in this commit.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/linux-2.6/xfs_sync.c |    7 ++-
 fs/xfs/xfs_trans_ail.c      |  206 +++++++++++++++++++++++++------------------
 fs/xfs/xfs_trans_priv.h     |    1 +
 3 files changed, 129 insertions(+), 85 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c
index 3f39d83..7ab79671 100644
--- a/fs/xfs/linux-2.6/xfs_sync.c
+++ b/fs/xfs/linux-2.6/xfs_sync.c
@@ -22,6 +22,7 @@
 #include "xfs_log.h"
 #include "xfs_inum.h"
 #include "xfs_trans.h"
+#include "xfs_trans_priv.h"
 #include "xfs_sb.h"
 #include "xfs_ag.h"
 #include "xfs_mount.h"
@@ -463,6 +464,9 @@ xfs_sync_worker(
 			xfs_log_force(mp, 0);
 		xfs_reclaim_inodes(mp, 0);
 		error = xfs_qm_sync(mp, SYNC_TRYLOCK);
+
+		/* start pushing all the metadata that is currently dirty */
+		xfs_trans_ail_push_all(mp->m_ail);
 	}
 
 	/* queue us up again */
@@ -1017,8 +1021,9 @@ xfs_reclaim_inode_shrink(
 
 	mp = container_of(shrink, struct xfs_mount, m_inode_shrink);
 	if (nr_to_scan) {
-		/* kick background reclaimer */
+		/* kick background reclaimer and push the AIL */
 		xfs_syncd_queue_reclaim(mp);
+		xfs_trans_ail_push_all(mp->m_ail);
 
 		if (!(gfp_mask & __GFP_FS))
 			return -1;
diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index 8b68679..62da328 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -30,17 +30,55 @@
 
 struct workqueue_struct	*xfs_ail_wq;	/* AIL workqueue */
 
-STATIC void xfs_ail_splice(struct xfs_ail *, struct list_head *, xfs_lsn_t);
-STATIC void xfs_ail_delete(struct xfs_ail *, xfs_log_item_t *);
-STATIC xfs_log_item_t * xfs_ail_min(struct xfs_ail *);
-STATIC xfs_log_item_t * xfs_ail_next(struct xfs_ail *, xfs_log_item_t *);
-
 #ifdef DEBUG
 STATIC void xfs_ail_check(struct xfs_ail *, xfs_log_item_t *);
 #else
 #define	xfs_ail_check(a,l)
 #endif /* DEBUG */
 
+/*
+ * Return a pointer to the first item in the AIL.
+ * If the AIL is empty, then return NULL.
+ */
+static xfs_log_item_t *
+xfs_ail_min(
+	struct xfs_ail  *ailp)
+{
+	if (list_empty(&ailp->xa_ail))
+		return NULL;
+
+	return list_first_entry(&ailp->xa_ail, xfs_log_item_t, li_ail);
+}
+
+/*
+ * Return a pointer to the last item in the AIL.
+ * If the AIL is empty, then return NULL.
+ */
+static xfs_log_item_t *
+xfs_ail_max(
+	struct xfs_ail  *ailp)
+{
+	if (list_empty(&ailp->xa_ail))
+		return NULL;
+
+	return list_entry(ailp->xa_ail.prev, xfs_log_item_t, li_ail);
+}
+
+/*
+ * Return a pointer to the item which follows
+ * the given item in the AIL.  If the given item
+ * is the last item in the list, then return NULL.
+ */
+static xfs_log_item_t *
+xfs_ail_next(
+	struct xfs_ail  *ailp,
+	xfs_log_item_t  *lip)
+{
+	if (lip->li_ail.next == &ailp->xa_ail)
+		return NULL;
+
+	return list_first_entry(&lip->li_ail, xfs_log_item_t, li_ail);
+}
 
 /*
  * This is called by the log manager code to determine the LSN
@@ -55,16 +93,32 @@ xfs_lsn_t
 xfs_trans_ail_tail(
 	struct xfs_ail	*ailp)
 {
-	xfs_lsn_t	lsn;
+	xfs_lsn_t	lsn = 0;
 	xfs_log_item_t	*lip;
 
 	spin_lock(&ailp->xa_lock);
 	lip = xfs_ail_min(ailp);
-	if (lip == NULL) {
-		lsn = (xfs_lsn_t)0;
-	} else {
+	if (lip)
+		lsn = lip->li_lsn;
+	spin_unlock(&ailp->xa_lock);
+
+	return lsn;
+}
+
+/*
+ * Return the maximum lsn held in the AIL, or zero if the AIl is empty.
+ */
+static xfs_lsn_t
+xfs_ail_max_lsn(
+	struct xfs_ail  *ailp)
+{
+	xfs_lsn_t       lsn = 0;
+	xfs_log_item_t  *lip;
+
+	spin_lock(&ailp->xa_lock);
+	lip = xfs_ail_max(ailp);
+	if (lip)
 		lsn = lip->li_lsn;
-	}
 	spin_unlock(&ailp->xa_lock);
 
 	return lsn;
@@ -208,6 +262,50 @@ out:
 }
 
 /*
+ * splice the log item list into the AIL at the given LSN.
+ */
+static void
+xfs_ail_splice(
+	struct xfs_ail  *ailp,
+	struct list_head *list,
+	xfs_lsn_t       lsn)
+{
+	xfs_log_item_t  *next_lip;
+
+	/*
+	 * If the list is empty, just insert the item.
+	 */
+	if (list_empty(&ailp->xa_ail)) {
+		list_splice(list, &ailp->xa_ail);
+		return;
+	}
+
+	list_for_each_entry_reverse(next_lip, &ailp->xa_ail, li_ail) {
+		if (XFS_LSN_CMP(next_lip->li_lsn, lsn) <= 0)
+			break;
+	}
+
+	ASSERT((&next_lip->li_ail == &ailp->xa_ail) ||
+	       (XFS_LSN_CMP(next_lip->li_lsn, lsn) <= 0));
+
+	list_splice_init(list, &next_lip->li_ail);
+	return;
+}
+
+/*
+ * Delete the given item from the AIL.  Return a pointer to the item.
+ */
+static void
+xfs_ail_delete(
+	struct xfs_ail  *ailp,
+	xfs_log_item_t  *lip)
+{
+	xfs_ail_check(ailp, lip);
+	list_del(&lip->li_ail);
+	xfs_trans_ail_cursor_clear(ailp, lip);
+}
+
+/*
  * xfs_ail_worker does the work of pushing on the AIL. It will requeue itself
  * to run at a later time if there is more work to do to complete the push.
  */
@@ -431,6 +529,20 @@ xfs_trans_ail_push(
 }
 
 /*
+ * similar to xfs_trans_push_ail(), except we are trying to push out all items
+ * in the AIL immediately
+ */
+void
+xfs_trans_ail_push_all(
+	struct xfs_ail  *ailp)
+{
+	xfs_lsn_t       threshold_lsn = xfs_ail_max_lsn(ailp);
+
+	if (threshold_lsn && !XFS_FORCED_SHUTDOWN(ailp->xa_mount))
+		xfs_ail_push_queue(ailp, threshold_lsn, 0);
+}
+
+/*
  * This is to be called when an item is unlocked that may have
  * been in the AIL.  It will wake up the first member of the AIL
  * wait list if this item's unlocking might allow it to progress.
@@ -663,80 +775,6 @@ xfs_trans_ail_destroy(
 	kmem_free(ailp);
 }
 
-/*
- * splice the log item list into the AIL at the given LSN.
- */
-STATIC void
-xfs_ail_splice(
-	struct xfs_ail	*ailp,
-	struct list_head *list,
-	xfs_lsn_t	lsn)
-{
-	xfs_log_item_t	*next_lip;
-
-	/*
-	 * If the list is empty, just insert the item.
-	 */
-	if (list_empty(&ailp->xa_ail)) {
-		list_splice(list, &ailp->xa_ail);
-		return;
-	}
-
-	list_for_each_entry_reverse(next_lip, &ailp->xa_ail, li_ail) {
-		if (XFS_LSN_CMP(next_lip->li_lsn, lsn) <= 0)
-			break;
-	}
-
-	ASSERT((&next_lip->li_ail == &ailp->xa_ail) ||
-	       (XFS_LSN_CMP(next_lip->li_lsn, lsn) <= 0));
-
-	list_splice_init(list, &next_lip->li_ail);
-	return;
-}
-
-/*
- * Delete the given item from the AIL.  Return a pointer to the item.
- */
-STATIC void
-xfs_ail_delete(
-	struct xfs_ail	*ailp,
-	xfs_log_item_t	*lip)
-{
-	xfs_ail_check(ailp, lip);
-	list_del(&lip->li_ail);
-	xfs_trans_ail_cursor_clear(ailp, lip);
-}
-
-/*
- * Return a pointer to the first item in the AIL.
- * If the AIL is empty, then return NULL.
- */
-STATIC xfs_log_item_t *
-xfs_ail_min(
-	struct xfs_ail	*ailp)
-{
-	if (list_empty(&ailp->xa_ail))
-		return NULL;
-
-	return list_first_entry(&ailp->xa_ail, xfs_log_item_t, li_ail);
-}
-
-/*
- * Return a pointer to the item which follows
- * the given item in the AIL.  If the given item
- * is the last item in the list, then return NULL.
- */
-STATIC xfs_log_item_t *
-xfs_ail_next(
-	struct xfs_ail	*ailp,
-	xfs_log_item_t	*lip)
-{
-	if (lip->li_ail.next == &ailp->xa_ail)
-		return NULL;
-
-	return list_first_entry(&lip->li_ail, xfs_log_item_t, li_ail);
-}
-
 #ifdef DEBUG
 /*
  * Check that the list is sorted as it should be.
diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h
index 2410da4..dbd93e6 100644
--- a/fs/xfs/xfs_trans_priv.h
+++ b/fs/xfs/xfs_trans_priv.h
@@ -103,6 +103,7 @@ xfs_trans_ail_delete(
 }
 
 void			xfs_trans_ail_push(struct xfs_ail *, xfs_lsn_t);
+void			xfs_trans_ail_push_all(struct xfs_ail *);
 void			xfs_trans_unlocked_item(struct xfs_ail *,
 					xfs_log_item_t *);
 
-- 
1.7.2.3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 6/6] xfs: push the AIL from memory reclaim and periodic sync
  2011-03-10  0:05 ` [PATCH 6/6] xfs: push the AIL from memory reclaim and periodic sync Dave Chinner
@ 2011-03-10 21:31   ` Christoph Hellwig
  2011-03-18  4:07     ` Dave Chinner
  0 siblings, 1 reply; 22+ messages in thread
From: Christoph Hellwig @ 2011-03-10 21:31 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

> -STATIC void xfs_ail_splice(struct xfs_ail *, struct list_head *, xfs_lsn_t);
> -STATIC void xfs_ail_delete(struct xfs_ail *, xfs_log_item_t *);
> -STATIC xfs_log_item_t * xfs_ail_min(struct xfs_ail *);
> -STATIC xfs_log_item_t * xfs_ail_next(struct xfs_ail *, xfs_log_item_t *);
> -

Reordering and cleanup of unrelated existing functions should be in a
separate patch.

> @@ -55,16 +93,32 @@ xfs_lsn_t
>  xfs_trans_ail_tail(
>  	struct xfs_ail	*ailp)
>  {
> -	xfs_lsn_t	lsn;
> +	xfs_lsn_t	lsn = 0;
>  	xfs_log_item_t	*lip;
>  
>  	spin_lock(&ailp->xa_lock);
>  	lip = xfs_ail_min(ailp);
> -	if (lip == NULL) {
> -		lsn = (xfs_lsn_t)0;
> -	} else {
> +	if (lip)
> +		lsn = lip->li_lsn;
> +	spin_unlock(&ailp->xa_lock);
> +
> +	return lsn;
> +}
> +
> +/*
> + * Return the maximum lsn held in the AIL, or zero if the AIl is empty.
> + */
> +static xfs_lsn_t
> +xfs_ail_max_lsn(
> +	struct xfs_ail  *ailp)
> +{
> +	xfs_lsn_t       lsn = 0;
> +	xfs_log_item_t  *lip;
> +
> +	spin_lock(&ailp->xa_lock);
> +	lip = xfs_ail_max(ailp);
> +	if (lip)

As this is the counterpart to xfs_trans_ail_tail the naming for both
should be similar.  I much prefer the descriptive _lsn naming over the
random trans in xfs_trans_ail_tail.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 6/6] xfs: push the AIL from memory reclaim and periodic sync
  2011-03-10 21:31   ` Christoph Hellwig
@ 2011-03-18  4:07     ` Dave Chinner
  0 siblings, 0 replies; 22+ messages in thread
From: Dave Chinner @ 2011-03-18  4:07 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Thu, Mar 10, 2011 at 04:31:44PM -0500, Christoph Hellwig wrote:
> > -STATIC void xfs_ail_splice(struct xfs_ail *, struct list_head *, xfs_lsn_t);
> > -STATIC void xfs_ail_delete(struct xfs_ail *, xfs_log_item_t *);
> > -STATIC xfs_log_item_t * xfs_ail_min(struct xfs_ail *);
> > -STATIC xfs_log_item_t * xfs_ail_next(struct xfs_ail *, xfs_log_item_t *);
> > -
> 
> Reordering and cleanup of unrelated existing functions should be in a
> separate patch.

OK.

> 
> > @@ -55,16 +93,32 @@ xfs_lsn_t
> >  xfs_trans_ail_tail(
> >  	struct xfs_ail	*ailp)
> >  {
> > -	xfs_lsn_t	lsn;
> > +	xfs_lsn_t	lsn = 0;
> >  	xfs_log_item_t	*lip;
> >  
> >  	spin_lock(&ailp->xa_lock);
> >  	lip = xfs_ail_min(ailp);
> > -	if (lip == NULL) {
> > -		lsn = (xfs_lsn_t)0;
> > -	} else {
> > +	if (lip)
> > +		lsn = lip->li_lsn;
> > +	spin_unlock(&ailp->xa_lock);
> > +
> > +	return lsn;
> > +}
> > +
> > +/*
> > + * Return the maximum lsn held in the AIL, or zero if the AIl is empty.
> > + */
> > +static xfs_lsn_t
> > +xfs_ail_max_lsn(
> > +	struct xfs_ail  *ailp)
> > +{
> > +	xfs_lsn_t       lsn = 0;
> > +	xfs_log_item_t  *lip;
> > +
> > +	spin_lock(&ailp->xa_lock);
> > +	lip = xfs_ail_max(ailp);
> > +	if (lip)
> 
> As this is the counterpart to xfs_trans_ail_tail the naming for both
> should be similar.  I much prefer the descriptive _lsn naming over the
> random trans in xfs_trans_ail_tail.

Ok, will change.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2011-04-04  2:08 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-03-10  0:05 [PATCH 0/6] xfs: Reduce OOM kill problems under heavy load V2 Dave Chinner
2011-03-10  0:05 ` [PATCH 1/6] xfs: introduce inode cluster buffer trylocks for xfs_iflush Dave Chinner
2011-03-10 17:31   ` Christoph Hellwig
2011-03-10  0:05 ` [PATCH 2/6] xfs: introduce a xfssyncd workqueue Dave Chinner
2011-03-10 17:34   ` Christoph Hellwig
2011-03-18  3:39     ` Dave Chinner
2011-03-10  0:05 ` [PATCH 3/6] xfs: convert ENOSPC inode flushing to use new syncd workqueue Dave Chinner
2011-03-10 17:35   ` Christoph Hellwig
2011-03-18  3:39     ` Dave Chinner
2011-03-10  0:05 ` [PATCH 4/6] xfs: introduce background inode reclaim work Dave Chinner
2011-03-10 17:40   ` Christoph Hellwig
2011-03-18  4:00     ` Dave Chinner
2011-03-10  0:05 ` [PATCH 5/6] xfs: convert the xfsaild threads to a workqueue Dave Chinner
2011-03-10 17:48   ` Christoph Hellwig
2011-03-18  4:06     ` Dave Chinner
2011-03-19 13:45       ` Christoph Hellwig
2011-04-03  0:38         ` Dave Chinner
2011-04-03 10:59           ` Christoph Hellwig
2011-04-04  2:11             ` Dave Chinner
2011-03-10  0:05 ` [PATCH 6/6] xfs: push the AIL from memory reclaim and periodic sync Dave Chinner
2011-03-10 21:31   ` Christoph Hellwig
2011-03-18  4:07     ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox