public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] [RFC] xfs: introduce an allocation workqueue
@ 2011-04-12 13:52 Dave Chinner
  2011-04-14 19:29 ` Alex Elder
  0 siblings, 1 reply; 4+ messages in thread
From: Dave Chinner @ 2011-04-12 13:52 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

We currently have significant issues with the amount of stack that
allocation in XFS uses, especially in the writeback path. We can
easily consume 4k of stack between mapping the page, manipulating
the bmap btree and allocating blocks from the free list. Not to
mention btree block readahead and other functionality that issues IO
in the allocation path.

As a result, we can no longer fit allocation in the writeback path
in the stack space provided on x86_64. To alleviate this problem,
introduce an allocation workqueue and move all allocations to a
seperate context. This can be easily added as an interposing layer
into xfs_alloc_vextent(), which takes a single argument structure
and does not return until the allocation is complete or has failed.

To do this, add a work structure and a completion to the allocation
args structure. This allows xfs_alloc_vextent to queue the args onto
the workqueue and wait for it to be completed by the worker. This
can be done completely transparently to the caller.

The worker function needs to ensure that it sets and clears the
PF_MEMALLOC flag appropriately as it is being run in an active
tranѕaction context. Work can also be queued in a memory reclaim
context, so a rescuer is needed for the workqueue.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/linux-2.6/xfs_super.c |   13 +++++++++++++
 fs/xfs/xfs_alloc.c           |   33 ++++++++++++++++++++++++++++++++-
 fs/xfs/xfs_alloc.h           |    5 +++++
 3 files changed, 50 insertions(+), 1 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_super.c b/fs/xfs/linux-2.6/xfs_super.c
index b38e58d..8669dd2 100644
--- a/fs/xfs/linux-2.6/xfs_super.c
+++ b/fs/xfs/linux-2.6/xfs_super.c
@@ -1733,8 +1733,21 @@ xfs_init_workqueues(void)
 	if (!xfs_ail_wq)
 		goto out_destroy_syncd;
 
+	/*
+	 * The allocation workqueue can be used in memory reclaim situations
+	 * (writepage path), and parallelism is only limited by the number of
+	 * AGs in all the filesystems mounted. Hence maxactive is set very
+	 * high.
+	 */
+	xfs_alloc_wq = alloc_workqueue("xfsalloc",
+				WQ_CPU_INTENSIVE | WQ_MEM_RECLAIM, 128);
+	if (!xfs_alloc_wq)
+		goto out_destroy_ail;
+
 	return 0;
 
+out_destroy_ail:
+	destroy_workqueue(xfs_ail_wq);
 out_destroy_syncd:
 	destroy_workqueue(xfs_syncd_wq);
 out:
diff --git a/fs/xfs/xfs_alloc.c b/fs/xfs/xfs_alloc.c
index 8946464..8334b08 100644
--- a/fs/xfs/xfs_alloc.c
+++ b/fs/xfs/xfs_alloc.c
@@ -35,6 +35,7 @@
 #include "xfs_error.h"
 #include "xfs_trace.h"
 
+struct workqueue_struct *xfs_alloc_wq;
 
 #define XFS_ABSDIFF(a,b)	(((a) <= (b)) ? ((b) - (a)) : ((a) - (b)))
 
@@ -2166,7 +2167,7 @@ xfs_alloc_read_agf(
  * group or loop over the allocation groups to find the result.
  */
 int				/* error */
-xfs_alloc_vextent(
+__xfs_alloc_vextent(
 	xfs_alloc_arg_t	*args)	/* allocation argument structure */
 {
 	xfs_agblock_t	agsize;	/* allocation group size */
@@ -2376,6 +2377,36 @@ error0:
 	return error;
 }
 
+static void
+xfs_alloc_vextent_worker(
+	struct work_struct	*work)
+{
+	struct xfs_alloc_arg	*args = container_of(work,
+						struct xfs_alloc_arg, work);
+
+	/* we are in a transaction context here */
+	current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
+
+	args->result = __xfs_alloc_vextent(args);
+	complete(args->done);
+
+	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+}
+
+
+int				/* error */
+xfs_alloc_vextent(
+	xfs_alloc_arg_t	*args)	/* allocation argument structure */
+{
+	DECLARE_COMPLETION_ONSTACK(done);
+
+	args->done = &done;
+	INIT_WORK(&args->work, xfs_alloc_vextent_worker);
+	queue_work(xfs_alloc_wq, &args->work);
+	wait_for_completion(&done);
+	return args->result;
+}
+
 /*
  * Free an extent.
  * Just break up the extent address and hand off to xfs_free_ag_extent
diff --git a/fs/xfs/xfs_alloc.h b/fs/xfs/xfs_alloc.h
index d0b3bc7..4b708c8 100644
--- a/fs/xfs/xfs_alloc.h
+++ b/fs/xfs/xfs_alloc.h
@@ -25,6 +25,8 @@ struct xfs_perag;
 struct xfs_trans;
 struct xfs_busy_extent;
 
+extern struct workqueue_struct *xfs_alloc_wq;
+
 /*
  * Freespace allocation types.  Argument to xfs_alloc_[v]extent.
  */
@@ -119,6 +121,9 @@ typedef struct xfs_alloc_arg {
 	char		isfl;		/* set if is freelist blocks - !acctg */
 	char		userdata;	/* set if this is user data */
 	xfs_fsblock_t	firstblock;	/* io first block allocated */
+	struct completion *done;
+	struct work_struct work;
+	int		result;
 } xfs_alloc_arg_t;
 
 /*
-- 
1.7.2.3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] [RFC] xfs: introduce an allocation workqueue
  2011-04-12 13:52 [PATCH] [RFC] xfs: introduce an allocation workqueue Dave Chinner
@ 2011-04-14 19:29 ` Alex Elder
  2011-04-15  1:49   ` Dave Chinner
  0 siblings, 1 reply; 4+ messages in thread
From: Alex Elder @ 2011-04-14 19:29 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Tue, 2011-04-12 at 23:52 +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> We currently have significant issues with the amount of stack that
> allocation in XFS uses, especially in the writeback path. We can
> easily consume 4k of stack between mapping the page, manipulating
> the bmap btree and allocating blocks from the free list. Not to
> mention btree block readahead and other functionality that issues IO
> in the allocation path.
> 
> As a result, we can no longer fit allocation in the writeback path
> in the stack space provided on x86_64. To alleviate this problem,
> introduce an allocation workqueue and move all allocations to a
> seperate context. This can be easily added as an interposing layer
> into xfs_alloc_vextent(), which takes a single argument structure
> and does not return until the allocation is complete or has failed.
> 
> To do this, add a work structure and a completion to the allocation
> args structure. This allows xfs_alloc_vextent to queue the args onto
> the workqueue and wait for it to be completed by the worker. This
> can be done completely transparently to the caller.
> 
> The worker function needs to ensure that it sets and clears the
> PF_MEMALLOC flag appropriately as it is being run in an active
> tranѕaction context. Work can also be queued in a memory reclaim
> context, so a rescuer is needed for the workqueue.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Interesting.

I guess I don't see anything inherently wrong with this.
At first glance it seems like workqueue abuse.  But it's
better than rolling our own daemon to do the same thing.
(There's nothing to stop you from doing this generically
either...)

Will it shift some accounting of CPU time from user to
system?

The code looks OK to me. The idea of it makes me pause
a little, though I don't object.

					-Alex

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] [RFC] xfs: introduce an allocation workqueue
  2011-04-14 19:29 ` Alex Elder
@ 2011-04-15  1:49   ` Dave Chinner
  2011-06-16 17:14     ` Alex Elder
  0 siblings, 1 reply; 4+ messages in thread
From: Dave Chinner @ 2011-04-15  1:49 UTC (permalink / raw)
  To: Alex Elder; +Cc: xfs

On Thu, Apr 14, 2011 at 02:29:34PM -0500, Alex Elder wrote:
> On Tue, 2011-04-12 at 23:52 +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > We currently have significant issues with the amount of stack that
> > allocation in XFS uses, especially in the writeback path. We can
> > easily consume 4k of stack between mapping the page, manipulating
> > the bmap btree and allocating blocks from the free list. Not to
> > mention btree block readahead and other functionality that issues IO
> > in the allocation path.
> > 
> > As a result, we can no longer fit allocation in the writeback path
> > in the stack space provided on x86_64. To alleviate this problem,
> > introduce an allocation workqueue and move all allocations to a
> > seperate context. This can be easily added as an interposing layer
> > into xfs_alloc_vextent(), which takes a single argument structure
> > and does not return until the allocation is complete or has failed.
> > 
> > To do this, add a work structure and a completion to the allocation
> > args structure. This allows xfs_alloc_vextent to queue the args onto
> > the workqueue and wait for it to be completed by the worker. This
> > can be done completely transparently to the caller.
> > 
> > The worker function needs to ensure that it sets and clears the
> > PF_MEMALLOC flag appropriately as it is being run in an active
> > tranѕaction context. Work can also be queued in a memory reclaim
> > context, so a rescuer is needed for the workqueue.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> 
> Interesting.
> 
> I guess I don't see anything inherently wrong with this.
> At first glance it seems like workqueue abuse.  But it's
> better than rolling our own daemon to do the same thing.
> (There's nothing to stop you from doing this generically
> either...)
> 
> Will it shift some accounting of CPU time from user to
> system?

Some, yes. For delayed allocation in the background flush or sync
path, there is no difference because that isn't accounted to the
user anyway. The only cases it will make a difference are for
foreground writeback (e.g. fsync or write throttling), and for inode
and directory allocation.

> The code looks OK to me. The idea of it makes me pause
> a little, though I don't object.

It's definitely a different way of thinking, but I'm seriously
starting to consider pushing more operations into asychronous
background tasks that the caller can wait on if they so desire.
e.g. EOF truncation during xfs_inactive and xfs_release(), flushing
a range of a file, inode cluster freeing, CIL flushes, etc.

We've got multiple cores in most machines these days (even the low
end is going multi-core), context switch overhead is pretty low and
we've got concurrency managed work queues so we don't need thread
pools - we should take advantage of what they provide....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] [RFC] xfs: introduce an allocation workqueue
  2011-04-15  1:49   ` Dave Chinner
@ 2011-06-16 17:14     ` Alex Elder
  0 siblings, 0 replies; 4+ messages in thread
From: Alex Elder @ 2011-06-16 17:14 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs@oss.sgi.com

On Thu, 2011-04-14 at 20:49 -0500, Dave Chinner wrote:
> On Thu, Apr 14, 2011 at 02:29:34PM -0500, Alex Elder wrote:
> > On Tue, 2011-04-12 at 23:52 +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > We currently have significant issues with the amount of stack that
> > > allocation in XFS uses, especially in the writeback path. We can
> > > easily consume 4k of stack between mapping the page, manipulating
> > > the bmap btree and allocating blocks from the free list. Not to
> > > mention btree block readahead and other functionality that issues IO
> > > in the allocation path.
> > > 
> > > As a result, we can no longer fit allocation in the writeback path
> > > in the stack space provided on x86_64. To alleviate this problem,
> > > introduce an allocation workqueue and move all allocations to a
> > > seperate context. This can be easily added as an interposing layer
> > > into xfs_alloc_vextent(), which takes a single argument structure
> > > and does not return until the allocation is complete or has failed.
> > > 
> > > To do this, add a work structure and a completion to the allocation
> > > args structure. This allows xfs_alloc_vextent to queue the args onto
> > > the workqueue and wait for it to be completed by the worker. This
> > > can be done completely transparently to the caller.
> > > 
> > > The worker function needs to ensure that it sets and clears the
> > > PF_MEMALLOC flag appropriately as it is being run in an active
> > > tranѕaction context. Work can also be queued in a memory reclaim
> > > context, so a rescuer is needed for the workqueue.
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > 
> > Interesting.
> > 
> > I guess I don't see anything inherently wrong with this.
> > At first glance it seems like workqueue abuse.  But it's
> > better than rolling our own daemon to do the same thing.
> > (There's nothing to stop you from doing this generically
> > either...)
> > 
> > Will it shift some accounting of CPU time from user to
> > system?
> 
> Some, yes. For delayed allocation in the background flush or sync
> path, there is no difference because that isn't accounted to the
> user anyway. The only cases it will make a difference are for
> foreground writeback (e.g. fsync or write throttling), and for inode
> and directory allocation.

Linus made a comment relevant to this yesterday,
but in reference to RCU making use of kernel threads:

        Re: REGRESSION: Performance regressions from switching
        anon_vma->lock to mutex
        
        So using anonymous kernel threads is actually a real downside.
        It makes it much less obvious what is going on. We saw that
        exact same thing with the generic worker thread conversions:
        things that used to have clear performance issues ("oh, the
        iwl-phy0 thread is using 3% of CPU time because it is polling
        for IO, and I can see that in 'top'") turned into
        much-harder-to-see issues ("oh, kwork0 us using 3% CPU time
        according to 'top' - I have no idea why").

Maybe if there were a way to reuse the generic
worker infrastructure but have it associated with
a kernel module or something.  Then again that is
sort of contrary to the purpose of the generic
centrally-managed pool.  Maybe if generic work
could somehow get tagged with its real owner
or origin it would be better, but I really have
no idea how one would go about that.

In any case, I think we need to be cautious about
moving in this direction.

					-Alex


> > The code looks OK to me. The idea of it makes me pause
> > a little, though I don't object.
> 
> It's definitely a different way of thinking, but I'm seriously
> starting to consider pushing more operations into asychronous
> background tasks that the caller can wait on if they so desire.
> e.g. EOF truncation during xfs_inactive and xfs_release(), flushing
> a range of a file, inode cluster freeing, CIL flushes, etc.
> 
> We've got multiple cores in most machines these days (even the low
> end is going multi-core), context switch overhead is pretty low and
> we've got concurrency managed work queues so we don't need thread
> pools - we should take advantage of what they provide....
> 
> Cheers,
> 
> Dave.



_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2011-06-16 17:14 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-04-12 13:52 [PATCH] [RFC] xfs: introduce an allocation workqueue Dave Chinner
2011-04-14 19:29 ` Alex Elder
2011-04-15  1:49   ` Dave Chinner
2011-06-16 17:14     ` Alex Elder

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox