[PATCH 0/6] Lazy workqueues

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/6] Lazy workqueues
@ 2009-08-20 10:19 Jens Axboe
  2009-08-20 10:19 ` [PATCH 1/6] workqueue: replace singlethread/freezable/rt parameters and variables with flags Jens Axboe
                   ` (8 more replies)
  0 siblings, 9 replies; 30+ messages in thread
From: Jens Axboe @ 2009-08-20 10:19 UTC (permalink / raw)
  To: linux-kernel; +Cc: jeff, benh, htejun, bzolnier, alan

(sorry for the resend, but apparently the directory had some patches
 in it already. plus, stupid git send-email doesn't default to
 no chain replies, really annoying)

Hi,

After yesterdays rant on having too many kernel threads and checking
how many I actually have running on this system (531!), I decided to 
try and do something about it.

My goal was to retain the workqueue interface instead of coming up with
a new scheme that required conversion (or converting to slow_work which,
btw, is an awful name :-). I also wanted to retain the affinity
guarantees of workqueues as much as possible.

So this is a first step in that direction, it's probably full of races
and holes, but should get the idea across. It adds a
create_lazy_workqueue() helper, similar to the other variants that we
currently have. A lazy workqueue works like a normal workqueue, except
that it only (by default) starts a core thread instead of threads for
all online CPUs. When work is queued on a lazy workqueue for a CPU
that doesn't have a thread running, it will be placed on the core CPUs
list and that will then create and move the work to the right target.
Should task creation fail, the queued work will be executed on the
core CPU instead. Once a lazy workqueue thread has been idle for a
certain amount of time, it will again exit.

The patch boots here and I exercised the rpciod workqueue and
verified that it gets created, runs on the right CPU, and exits a while
later. So core functionality should be there, even if it has holes.

With this patchset, I am now down to 280 kernel threads on one of my test
boxes. Still too many, but it's a start and a net reduction of 251
threads here, or 47%!

The code can also be pulled from:

  git://git.kernel.dk/linux-2.6-block.git workqueue

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 1/6] workqueue: replace singlethread/freezable/rt parameters and variables with flags
  2009-08-20 10:19 [PATCH 0/6] Lazy workqueues Jens Axboe
@ 2009-08-20 10:19 ` Jens Axboe
  2009-08-20 10:20 ` [PATCH 2/6] workqueue: add support for lazy workqueues Jens Axboe
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 30+ messages in thread
From: Jens Axboe @ 2009-08-20 10:19 UTC (permalink / raw)
  To: linux-kernel; +Cc: jeff, benh, htejun, bzolnier, alan, Jens Axboe

Collapse the three ints into a flags variable, in preparation for
adding another flag.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
 include/linux/workqueue.h |   32 ++++++++++++++++++--------------
 kernel/workqueue.c        |   22 ++++++++--------------
 2 files changed, 26 insertions(+), 28 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 13e1adf..f14e20e 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -165,12 +165,17 @@ struct execute_work {
 
 
 extern struct workqueue_struct *
-__create_workqueue_key(const char *name, int singlethread,
-		       int freezeable, int rt, struct lock_class_key *key,
-		       const char *lock_name);
+__create_workqueue_key(const char *name, unsigned int flags,
+		       struct lock_class_key *key, const char *lock_name);
+
+enum {
+	WQ_F_SINGLETHREAD	= 1,
+	WQ_F_FREEZABLE		= 2,
+	WQ_F_RT			= 4,
+};
 
 #ifdef CONFIG_LOCKDEP
-#define __create_workqueue(name, singlethread, freezeable, rt)	\
+#define __create_workqueue(name, flags)				\
 ({								\
 	static struct lock_class_key __key;			\
 	const char *__lock_name;				\
@@ -180,20 +185,19 @@ __create_workqueue_key(const char *name, int singlethread,
 	else							\
 		__lock_name = #name;				\
 								\
-	__create_workqueue_key((name), (singlethread),		\
-			       (freezeable), (rt), &__key,	\
-			       __lock_name);			\
+	__create_workqueue_key((name), (flags), &__key, __lockname);	\
 })
 #else
-#define __create_workqueue(name, singlethread, freezeable, rt)	\
-	__create_workqueue_key((name), (singlethread), (freezeable), (rt), \
-			       NULL, NULL)
+#define __create_workqueue(name, flags)				\
+	__create_workqueue_key((name), (flags), NULL, NULL)	
 #endif
 
-#define create_workqueue(name) __create_workqueue((name), 0, 0, 0)
-#define create_rt_workqueue(name) __create_workqueue((name), 0, 0, 1)
-#define create_freezeable_workqueue(name) __create_workqueue((name), 1, 1, 0)
-#define create_singlethread_workqueue(name) __create_workqueue((name), 1, 0, 0)
+#define create_workqueue(name)		__create_workqueue((name), 0)
+#define create_rt_workqueue(name)	__create_workqueue((name), WQ_F_RT)
+#define create_freezeable_workqueue(name)	\
+	__create_workqueue((name), WQ_F_SINGLETHREAD | WQ_F_FREEZABLE)
+#define create_singlethread_workqueue(name)	\
+	__create_workqueue((name), WQ_F_SINGLETHREAD)
 
 extern void destroy_workqueue(struct workqueue_struct *wq);
 
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 0668795..02ba7c9 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -60,9 +60,7 @@ struct workqueue_struct {
 	struct cpu_workqueue_struct *cpu_wq;
 	struct list_head list;
 	const char *name;
-	int singlethread;
-	int freezeable;		/* Freeze threads during suspend */
-	int rt;
+	unsigned int flags;	/* WQ_F_* flags */
 #ifdef CONFIG_LOCKDEP
 	struct lockdep_map lockdep_map;
 #endif
@@ -84,9 +82,9 @@ static const struct cpumask *cpu_singlethread_map __read_mostly;
 static cpumask_var_t cpu_populated_map __read_mostly;
 
 /* If it's single threaded, it isn't in the list of workqueues. */
-static inline int is_wq_single_threaded(struct workqueue_struct *wq)
+static inline bool is_wq_single_threaded(struct workqueue_struct *wq)
 {
-	return wq->singlethread;
+	return wq->flags & WQ_F_SINGLETHREAD;
 }
 
 static const struct cpumask *wq_cpu_map(struct workqueue_struct *wq)
@@ -314,7 +312,7 @@ static int worker_thread(void *__cwq)
 	struct cpu_workqueue_struct *cwq = __cwq;
 	DEFINE_WAIT(wait);
 
-	if (cwq->wq->freezeable)
+	if (cwq->wq->flags & WQ_F_FREEZABLE)
 		set_freezable();
 
 	set_user_nice(current, -5);
@@ -768,7 +766,7 @@ static int create_workqueue_thread(struct cpu_workqueue_struct *cwq, int cpu)
 	 */
 	if (IS_ERR(p))
 		return PTR_ERR(p);
-	if (cwq->wq->rt)
+	if (cwq->wq->flags & WQ_F_RT)
 		sched_setscheduler_nocheck(p, SCHED_FIFO, &param);
 	cwq->thread = p;
 
@@ -789,9 +787,7 @@ static void start_workqueue_thread(struct cpu_workqueue_struct *cwq, int cpu)
 }
 
 struct workqueue_struct *__create_workqueue_key(const char *name,
-						int singlethread,
-						int freezeable,
-						int rt,
+						unsigned int flags,
 						struct lock_class_key *key,
 						const char *lock_name)
 {
@@ -811,12 +807,10 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 
 	wq->name = name;
 	lockdep_init_map(&wq->lockdep_map, lock_name, key, 0);
-	wq->singlethread = singlethread;
-	wq->freezeable = freezeable;
-	wq->rt = rt;
+	wq->flags = flags;
 	INIT_LIST_HEAD(&wq->list);
 
-	if (singlethread) {
+	if (flags & WQ_F_SINGLETHREAD) {
 		cwq = init_cpu_workqueue(wq, singlethread_cpu);
 		err = create_workqueue_thread(cwq, singlethread_cpu);
 		start_workqueue_thread(cwq, -1);
-- 
1.6.4.173.g3f189


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 2/6] workqueue: add support for lazy workqueues
  2009-08-20 10:19 [PATCH 0/6] Lazy workqueues Jens Axboe
  2009-08-20 10:19 ` [PATCH 1/6] workqueue: replace singlethread/freezable/rt parameters and variables with flags Jens Axboe
@ 2009-08-20 10:20 ` Jens Axboe
  2009-08-20 12:01   ` Frederic Weisbecker
  2009-08-20 10:20 ` [PATCH 3/6] crypto: use " Jens Axboe
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 30+ messages in thread
From: Jens Axboe @ 2009-08-20 10:20 UTC (permalink / raw)
  To: linux-kernel; +Cc: jeff, benh, htejun, bzolnier, alan, Jens Axboe

Lazy workqueues are like normal workqueues, except they don't
start a thread per CPU by default. Instead threads are started
when they are needed, and exit when they have been idle for
some time.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
 include/linux/workqueue.h |    5 ++
 kernel/workqueue.c        |  152 ++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 147 insertions(+), 10 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index f14e20e..b2dd267 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -32,6 +32,7 @@ struct work_struct {
 #ifdef CONFIG_LOCKDEP
 	struct lockdep_map lockdep_map;
 #endif
+	unsigned int cpu;
 };
 
 #define WORK_DATA_INIT()	ATOMIC_LONG_INIT(0)
@@ -172,6 +173,7 @@ enum {
 	WQ_F_SINGLETHREAD	= 1,
 	WQ_F_FREEZABLE		= 2,
 	WQ_F_RT			= 4,
+	WQ_F_LAZY		= 8,
 };
 
 #ifdef CONFIG_LOCKDEP
@@ -198,6 +200,7 @@ enum {
 	__create_workqueue((name), WQ_F_SINGLETHREAD | WQ_F_FREEZABLE)
 #define create_singlethread_workqueue(name)	\
 	__create_workqueue((name), WQ_F_SINGLETHREAD)
+#define create_lazy_workqueue(name)	__create_workqueue((name), WQ_F_LAZY)
 
 extern void destroy_workqueue(struct workqueue_struct *wq);
 
@@ -211,6 +214,8 @@ extern int queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
 
 extern void flush_workqueue(struct workqueue_struct *wq);
 extern void flush_scheduled_work(void);
+extern void workqueue_set_lazy_timeout(struct workqueue_struct *wq,
+			unsigned long timeout);
 
 extern int schedule_work(struct work_struct *work);
 extern int schedule_work_on(int cpu, struct work_struct *work);
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 02ba7c9..d9ccebc 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -61,11 +61,17 @@ struct workqueue_struct {
 	struct list_head list;
 	const char *name;
 	unsigned int flags;	/* WQ_F_* flags */
+	unsigned long lazy_timeout;
+	unsigned int core_cpu;
 #ifdef CONFIG_LOCKDEP
 	struct lockdep_map lockdep_map;
 #endif
 };
 
+/* Default lazy workqueue timeout */
+#define WQ_DEF_LAZY_TIMEOUT	(60 * HZ)
+
+
 /* Serializes the accesses to the list of workqueues. */
 static DEFINE_SPINLOCK(workqueue_lock);
 static LIST_HEAD(workqueues);
@@ -81,6 +87,8 @@ static const struct cpumask *cpu_singlethread_map __read_mostly;
  */
 static cpumask_var_t cpu_populated_map __read_mostly;
 
+static int create_workqueue_thread(struct cpu_workqueue_struct *cwq, int cpu);
+
 /* If it's single threaded, it isn't in the list of workqueues. */
 static inline bool is_wq_single_threaded(struct workqueue_struct *wq)
 {
@@ -141,11 +149,29 @@ static void insert_work(struct cpu_workqueue_struct *cwq,
 static void __queue_work(struct cpu_workqueue_struct *cwq,
 			 struct work_struct *work)
 {
+	struct workqueue_struct *wq = cwq->wq;
 	unsigned long flags;
 
-	spin_lock_irqsave(&cwq->lock, flags);
-	insert_work(cwq, work, &cwq->worklist);
-	spin_unlock_irqrestore(&cwq->lock, flags);
+	/*
+	 * This is a lazy workqueue and this particular CPU thread has
+	 * exited. We can't create it from here, so add this work on our
+	 * static thread. It will create this thread and move the work there.
+	 */
+	if ((wq->flags & WQ_F_LAZY) && !cwq->thread) {
+		struct cpu_workqueue_struct *__cwq;
+
+		local_irq_save(flags);
+		__cwq = wq_per_cpu(wq, wq->core_cpu);
+		work->cpu = smp_processor_id();
+		spin_lock(&__cwq->lock);
+		insert_work(__cwq, work, &__cwq->worklist);
+		spin_unlock_irqrestore(&__cwq->lock, flags);
+	} else {
+		spin_lock_irqsave(&cwq->lock, flags);
+		work->cpu = smp_processor_id();
+		insert_work(cwq, work, &cwq->worklist);
+		spin_unlock_irqrestore(&cwq->lock, flags);
+	}
 }
 
 /**
@@ -259,13 +285,16 @@ int queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
 }
 EXPORT_SYMBOL_GPL(queue_delayed_work_on);
 
-static void run_workqueue(struct cpu_workqueue_struct *cwq)
+static int run_workqueue(struct cpu_workqueue_struct *cwq)
 {
+	int did_work = 0;
+
 	spin_lock_irq(&cwq->lock);
 	while (!list_empty(&cwq->worklist)) {
 		struct work_struct *work = list_entry(cwq->worklist.next,
 						struct work_struct, entry);
 		work_func_t f = work->func;
+		int cpu;
 #ifdef CONFIG_LOCKDEP
 		/*
 		 * It is permissible to free the struct work_struct
@@ -280,7 +309,34 @@ static void run_workqueue(struct cpu_workqueue_struct *cwq)
 		trace_workqueue_execution(cwq->thread, work);
 		cwq->current_work = work;
 		list_del_init(cwq->worklist.next);
+		cpu = smp_processor_id();
 		spin_unlock_irq(&cwq->lock);
+		did_work = 1;
+
+		/*
+		 * If work->cpu isn't us, then we need to create the target
+		 * workqueue thread (if someone didn't already do that) and
+		 * move the work over there.
+		 */
+		if ((cwq->wq->flags & WQ_F_LAZY) && work->cpu != cpu) {
+			struct cpu_workqueue_struct *__cwq;
+			struct task_struct *p;
+			int err;
+
+			__cwq = wq_per_cpu(cwq->wq, work->cpu);
+			p = __cwq->thread;
+			if (!p)
+				err = create_workqueue_thread(__cwq, work->cpu);
+			p = __cwq->thread;
+			if (p) {
+				if (work->cpu >= 0)
+					kthread_bind(p, work->cpu);
+				insert_work(__cwq, work, &__cwq->worklist);
+				wake_up_process(p);
+				goto out;
+			}
+		}
+
 
 		BUG_ON(get_wq_data(work) != cwq);
 		work_clear_pending(work);
@@ -305,24 +361,45 @@ static void run_workqueue(struct cpu_workqueue_struct *cwq)
 		cwq->current_work = NULL;
 	}
 	spin_unlock_irq(&cwq->lock);
+out:
+	return did_work;
 }
 
 static int worker_thread(void *__cwq)
 {
 	struct cpu_workqueue_struct *cwq = __cwq;
+	struct workqueue_struct *wq = cwq->wq;
+	unsigned long last_active = jiffies;
 	DEFINE_WAIT(wait);
+	int may_exit;
 
-	if (cwq->wq->flags & WQ_F_FREEZABLE)
+	if (wq->flags & WQ_F_FREEZABLE)
 		set_freezable();
 
 	set_user_nice(current, -5);
 
+	/*
+	 * Allow exit if this isn't our core thread
+	 */
+	if ((wq->flags & WQ_F_LAZY) && smp_processor_id() != wq->core_cpu)
+		may_exit = 1;
+	else
+		may_exit = 0;
+
 	for (;;) {
+		int did_work;
+
 		prepare_to_wait(&cwq->more_work, &wait, TASK_INTERRUPTIBLE);
 		if (!freezing(current) &&
 		    !kthread_should_stop() &&
-		    list_empty(&cwq->worklist))
-			schedule();
+		    list_empty(&cwq->worklist)) {
+			unsigned long timeout = wq->lazy_timeout;
+
+			if (timeout && may_exit)
+				schedule_timeout(timeout);
+			else
+				schedule();
+		}
 		finish_wait(&cwq->more_work, &wait);
 
 		try_to_freeze();
@@ -330,7 +407,19 @@ static int worker_thread(void *__cwq)
 		if (kthread_should_stop())
 			break;
 
-		run_workqueue(cwq);
+		did_work = run_workqueue(cwq);
+
+		/*
+		 * If we did no work for the defined timeout period and we are
+		 * allowed to exit, do so.
+		 */
+		if (did_work)
+			last_active = jiffies;
+		else if (time_after(jiffies, last_active + wq->lazy_timeout) &&
+			 may_exit) {
+			cwq->thread = NULL;
+			break;
+		}
 	}
 
 	return 0;
@@ -814,7 +903,10 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 		cwq = init_cpu_workqueue(wq, singlethread_cpu);
 		err = create_workqueue_thread(cwq, singlethread_cpu);
 		start_workqueue_thread(cwq, -1);
+		wq->core_cpu = singlethread_cpu;
 	} else {
+		int created = 0;
+
 		cpu_maps_update_begin();
 		/*
 		 * We must place this wq on list even if the code below fails.
@@ -833,10 +925,16 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 		 */
 		for_each_possible_cpu(cpu) {
 			cwq = init_cpu_workqueue(wq, cpu);
-			if (err || !cpu_online(cpu))
+			if (err || !cpu_online(cpu) ||
+			    (created && (wq->flags & WQ_F_LAZY)))
 				continue;
 			err = create_workqueue_thread(cwq, cpu);
 			start_workqueue_thread(cwq, cpu);
+			if (!err) {
+				if (!created)
+					wq->core_cpu = cpu;
+				created++;
+			}
 		}
 		cpu_maps_update_done();
 	}
@@ -844,7 +942,9 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 	if (err) {
 		destroy_workqueue(wq);
 		wq = NULL;
-	}
+	} else if (wq->flags & WQ_F_LAZY)
+		workqueue_set_lazy_timeout(wq, WQ_DEF_LAZY_TIMEOUT);
+
 	return wq;
 }
 EXPORT_SYMBOL_GPL(__create_workqueue_key);
@@ -877,6 +977,13 @@ static void cleanup_workqueue_thread(struct cpu_workqueue_struct *cwq)
 	cwq->thread = NULL;
 }
 
+static bool hotplug_should_start_thread(struct workqueue_struct *wq, int cpu)
+{
+	if ((wq->flags & WQ_F_LAZY) && cpu != wq->core_cpu)
+		return 0;
+	return 1;
+}
+
 /**
  * destroy_workqueue - safely terminate a workqueue
  * @wq: target workqueue
@@ -923,6 +1030,8 @@ undo:
 
 		switch (action) {
 		case CPU_UP_PREPARE:
+			if (!hotplug_should_start_thread(wq, cpu))
+				break;
 			if (!create_workqueue_thread(cwq, cpu))
 				break;
 			printk(KERN_ERR "workqueue [%s] for %i failed\n",
@@ -932,6 +1041,8 @@ undo:
 			goto undo;
 
 		case CPU_ONLINE:
+			if (!hotplug_should_start_thread(wq, cpu))
+				break;
 			start_workqueue_thread(cwq, cpu);
 			break;
 
@@ -999,6 +1110,27 @@ long work_on_cpu(unsigned int cpu, long (*fn)(void *), void *arg)
 EXPORT_SYMBOL_GPL(work_on_cpu);
 #endif /* CONFIG_SMP */
 
+/**
+ * workqueue_set_lazy_timeout - set lazy exit timeout
+ * @wq: the associated workqueue_struct
+ * @timeout: timeout in jiffies
+ *
+ * This will set the timeout for a lazy workqueue. If no work has been
+ * processed for @timeout jiffies, then the workqueue is allowed to exit.
+ * It will be dynamically created again when work is queued to it.
+ *
+ * Note that this only works for workqueues created with
+ * create_lazy_workqueue().
+ */
+void workqueue_set_lazy_timeout(struct workqueue_struct *wq,
+				unsigned long timeout)
+{
+	if (WARN_ON(!(wq->flags & WQ_F_LAZY)))
+		return;
+
+	wq->lazy_timeout = timeout;
+}
+
 void __init init_workqueues(void)
 {
 	alloc_cpumask_var(&cpu_populated_map, GFP_KERNEL);
-- 
1.6.4.173.g3f189


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/6] workqueue: add support for lazy workqueues
  2009-08-20 10:20 ` [PATCH 2/6] workqueue: add support for lazy workqueues Jens Axboe
@ 2009-08-20 12:01   ` Frederic Weisbecker
  2009-08-20 12:10     ` Jens Axboe
  0 siblings, 1 reply; 30+ messages in thread
From: Frederic Weisbecker @ 2009-08-20 12:01 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, jeff, benh, htejun, bzolnier, alan

On Thu, Aug 20, 2009 at 12:20:00PM +0200, Jens Axboe wrote:
> Lazy workqueues are like normal workqueues, except they don't
> start a thread per CPU by default. Instead threads are started
> when they are needed, and exit when they have been idle for
> some time.
> 
> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
> ---
>  include/linux/workqueue.h |    5 ++
>  kernel/workqueue.c        |  152 ++++++++++++++++++++++++++++++++++++++++++---
>  2 files changed, 147 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
> index f14e20e..b2dd267 100644
> --- a/include/linux/workqueue.h
> +++ b/include/linux/workqueue.h
> @@ -32,6 +32,7 @@ struct work_struct {
>  #ifdef CONFIG_LOCKDEP
>  	struct lockdep_map lockdep_map;
>  #endif
> +	unsigned int cpu;
>  };
>  
>  #define WORK_DATA_INIT()	ATOMIC_LONG_INIT(0)
> @@ -172,6 +173,7 @@ enum {
>  	WQ_F_SINGLETHREAD	= 1,
>  	WQ_F_FREEZABLE		= 2,
>  	WQ_F_RT			= 4,
> +	WQ_F_LAZY		= 8,
>  };
>  
>  #ifdef CONFIG_LOCKDEP
> @@ -198,6 +200,7 @@ enum {
>  	__create_workqueue((name), WQ_F_SINGLETHREAD | WQ_F_FREEZABLE)
>  #define create_singlethread_workqueue(name)	\
>  	__create_workqueue((name), WQ_F_SINGLETHREAD)
> +#define create_lazy_workqueue(name)	__create_workqueue((name), WQ_F_LAZY)
>  
>  extern void destroy_workqueue(struct workqueue_struct *wq);
>  
> @@ -211,6 +214,8 @@ extern int queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
>  
>  extern void flush_workqueue(struct workqueue_struct *wq);
>  extern void flush_scheduled_work(void);
> +extern void workqueue_set_lazy_timeout(struct workqueue_struct *wq,
> +			unsigned long timeout);
>  
>  extern int schedule_work(struct work_struct *work);
>  extern int schedule_work_on(int cpu, struct work_struct *work);
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 02ba7c9..d9ccebc 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -61,11 +61,17 @@ struct workqueue_struct {
>  	struct list_head list;
>  	const char *name;
>  	unsigned int flags;	/* WQ_F_* flags */
> +	unsigned long lazy_timeout;
> +	unsigned int core_cpu;
>  #ifdef CONFIG_LOCKDEP
>  	struct lockdep_map lockdep_map;
>  #endif
>  };
>  
> +/* Default lazy workqueue timeout */
> +#define WQ_DEF_LAZY_TIMEOUT	(60 * HZ)
> +
> +
>  /* Serializes the accesses to the list of workqueues. */
>  static DEFINE_SPINLOCK(workqueue_lock);
>  static LIST_HEAD(workqueues);
> @@ -81,6 +87,8 @@ static const struct cpumask *cpu_singlethread_map __read_mostly;
>   */
>  static cpumask_var_t cpu_populated_map __read_mostly;
>  
> +static int create_workqueue_thread(struct cpu_workqueue_struct *cwq, int cpu);
> +
>  /* If it's single threaded, it isn't in the list of workqueues. */
>  static inline bool is_wq_single_threaded(struct workqueue_struct *wq)
>  {
> @@ -141,11 +149,29 @@ static void insert_work(struct cpu_workqueue_struct *cwq,
>  static void __queue_work(struct cpu_workqueue_struct *cwq,
>  			 struct work_struct *work)
>  {
> +	struct workqueue_struct *wq = cwq->wq;
>  	unsigned long flags;
>  
> -	spin_lock_irqsave(&cwq->lock, flags);
> -	insert_work(cwq, work, &cwq->worklist);
> -	spin_unlock_irqrestore(&cwq->lock, flags);
> +	/*
> +	 * This is a lazy workqueue and this particular CPU thread has
> +	 * exited. We can't create it from here, so add this work on our
> +	 * static thread. It will create this thread and move the work there.
> +	 */
> +	if ((wq->flags & WQ_F_LAZY) && !cwq->thread) {



Isn't this part racy? If a work has just been queued but the thread
hasn't had yet enough time to start until we get there...?


> +		struct cpu_workqueue_struct *__cwq;
> +
> +		local_irq_save(flags);
> +		__cwq = wq_per_cpu(wq, wq->core_cpu);
> +		work->cpu = smp_processor_id();
> +		spin_lock(&__cwq->lock);
> +		insert_work(__cwq, work, &__cwq->worklist);
> +		spin_unlock_irqrestore(&__cwq->lock, flags);
> +	} else {
> +		spin_lock_irqsave(&cwq->lock, flags);
> +		work->cpu = smp_processor_id();
> +		insert_work(cwq, work, &cwq->worklist);
> +		spin_unlock_irqrestore(&cwq->lock, flags);
> +	}
>  }


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/6] workqueue: add support for lazy workqueues
  2009-08-20 12:01   ` Frederic Weisbecker
@ 2009-08-20 12:10     ` Jens Axboe
  0 siblings, 0 replies; 30+ messages in thread
From: Jens Axboe @ 2009-08-20 12:10 UTC (permalink / raw)
  To: Frederic Weisbecker; +Cc: linux-kernel, jeff, benh, htejun, bzolnier, alan

On Thu, Aug 20 2009, Frederic Weisbecker wrote:
> On Thu, Aug 20, 2009 at 12:20:00PM +0200, Jens Axboe wrote:
> > Lazy workqueues are like normal workqueues, except they don't
> > start a thread per CPU by default. Instead threads are started
> > when they are needed, and exit when they have been idle for
> > some time.
> > 
> > Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
> > ---
> >  include/linux/workqueue.h |    5 ++
> >  kernel/workqueue.c        |  152 ++++++++++++++++++++++++++++++++++++++++++---
> >  2 files changed, 147 insertions(+), 10 deletions(-)
> > 
> > diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
> > index f14e20e..b2dd267 100644
> > --- a/include/linux/workqueue.h
> > +++ b/include/linux/workqueue.h
> > @@ -32,6 +32,7 @@ struct work_struct {
> >  #ifdef CONFIG_LOCKDEP
> >  	struct lockdep_map lockdep_map;
> >  #endif
> > +	unsigned int cpu;
> >  };
> >  
> >  #define WORK_DATA_INIT()	ATOMIC_LONG_INIT(0)
> > @@ -172,6 +173,7 @@ enum {
> >  	WQ_F_SINGLETHREAD	= 1,
> >  	WQ_F_FREEZABLE		= 2,
> >  	WQ_F_RT			= 4,
> > +	WQ_F_LAZY		= 8,
> >  };
> >  
> >  #ifdef CONFIG_LOCKDEP
> > @@ -198,6 +200,7 @@ enum {
> >  	__create_workqueue((name), WQ_F_SINGLETHREAD | WQ_F_FREEZABLE)
> >  #define create_singlethread_workqueue(name)	\
> >  	__create_workqueue((name), WQ_F_SINGLETHREAD)
> > +#define create_lazy_workqueue(name)	__create_workqueue((name), WQ_F_LAZY)
> >  
> >  extern void destroy_workqueue(struct workqueue_struct *wq);
> >  
> > @@ -211,6 +214,8 @@ extern int queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
> >  
> >  extern void flush_workqueue(struct workqueue_struct *wq);
> >  extern void flush_scheduled_work(void);
> > +extern void workqueue_set_lazy_timeout(struct workqueue_struct *wq,
> > +			unsigned long timeout);
> >  
> >  extern int schedule_work(struct work_struct *work);
> >  extern int schedule_work_on(int cpu, struct work_struct *work);
> > diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> > index 02ba7c9..d9ccebc 100644
> > --- a/kernel/workqueue.c
> > +++ b/kernel/workqueue.c
> > @@ -61,11 +61,17 @@ struct workqueue_struct {
> >  	struct list_head list;
> >  	const char *name;
> >  	unsigned int flags;	/* WQ_F_* flags */
> > +	unsigned long lazy_timeout;
> > +	unsigned int core_cpu;
> >  #ifdef CONFIG_LOCKDEP
> >  	struct lockdep_map lockdep_map;
> >  #endif
> >  };
> >  
> > +/* Default lazy workqueue timeout */
> > +#define WQ_DEF_LAZY_TIMEOUT	(60 * HZ)
> > +
> > +
> >  /* Serializes the accesses to the list of workqueues. */
> >  static DEFINE_SPINLOCK(workqueue_lock);
> >  static LIST_HEAD(workqueues);
> > @@ -81,6 +87,8 @@ static const struct cpumask *cpu_singlethread_map __read_mostly;
> >   */
> >  static cpumask_var_t cpu_populated_map __read_mostly;
> >  
> > +static int create_workqueue_thread(struct cpu_workqueue_struct *cwq, int cpu);
> > +
> >  /* If it's single threaded, it isn't in the list of workqueues. */
> >  static inline bool is_wq_single_threaded(struct workqueue_struct *wq)
> >  {
> > @@ -141,11 +149,29 @@ static void insert_work(struct cpu_workqueue_struct *cwq,
> >  static void __queue_work(struct cpu_workqueue_struct *cwq,
> >  			 struct work_struct *work)
> >  {
> > +	struct workqueue_struct *wq = cwq->wq;
> >  	unsigned long flags;
> >  
> > -	spin_lock_irqsave(&cwq->lock, flags);
> > -	insert_work(cwq, work, &cwq->worklist);
> > -	spin_unlock_irqrestore(&cwq->lock, flags);
> > +	/*
> > +	 * This is a lazy workqueue and this particular CPU thread has
> > +	 * exited. We can't create it from here, so add this work on our
> > +	 * static thread. It will create this thread and move the work there.
> > +	 */
> > +	if ((wq->flags & WQ_F_LAZY) && !cwq->thread) {
> 
> 
> 
> Isn't this part racy? If a work has just been queued but the thread
> hasn't had yet enough time to start until we get there...?

Sure it is, see my initial description about holes and races :-)
Thread re-recreation and such need to ensure that one and only one gets
set up, of course. I just didn't want to spend a lot of time making it
air tight in case people had big complaints that means I have to rewrite
bits of it.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 3/6] crypto: use lazy workqueues
  2009-08-20 10:19 [PATCH 0/6] Lazy workqueues Jens Axboe
  2009-08-20 10:19 ` [PATCH 1/6] workqueue: replace singlethread/freezable/rt parameters and variables with flags Jens Axboe
  2009-08-20 10:20 ` [PATCH 2/6] workqueue: add support for lazy workqueues Jens Axboe
@ 2009-08-20 10:20 ` Jens Axboe
  2009-08-20 10:20 ` [PATCH 4/6] libata: use lazy workqueues for the pio task Jens Axboe
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 30+ messages in thread
From: Jens Axboe @ 2009-08-20 10:20 UTC (permalink / raw)
  To: linux-kernel; +Cc: jeff, benh, htejun, bzolnier, alan, Jens Axboe

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
 crypto/crypto_wq.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/crypto/crypto_wq.c b/crypto/crypto_wq.c
index fdcf624..88cccee 100644
--- a/crypto/crypto_wq.c
+++ b/crypto/crypto_wq.c
@@ -20,7 +20,7 @@ EXPORT_SYMBOL_GPL(kcrypto_wq);
 
 static int __init crypto_wq_init(void)
 {
-	kcrypto_wq = create_workqueue("crypto");
+	kcrypto_wq = create_lazy_workqueue("crypto");
 	if (unlikely(!kcrypto_wq))
 		return -ENOMEM;
 	return 0;
-- 
1.6.4.173.g3f189


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 4/6] libata: use lazy workqueues for the pio task
  2009-08-20 10:19 [PATCH 0/6] Lazy workqueues Jens Axboe
                   ` (2 preceding siblings ...)
  2009-08-20 10:20 ` [PATCH 3/6] crypto: use " Jens Axboe
@ 2009-08-20 10:20 ` Jens Axboe
  2009-08-20 12:40   ` Stefan Richter
  2009-08-20 10:20 ` [PATCH 5/6] aio: use lazy workqueues Jens Axboe
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 30+ messages in thread
From: Jens Axboe @ 2009-08-20 10:20 UTC (permalink / raw)
  To: linux-kernel; +Cc: jeff, benh, htejun, bzolnier, alan, Jens Axboe

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
 drivers/ata/libata-core.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index 072ba5e..35f74c9 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -6580,7 +6580,7 @@ static int __init ata_init(void)
 {
 	ata_parse_force_param();
 
-	ata_wq = create_workqueue("ata");
+	ata_wq = create_lazy_workqueue("ata");
 	if (!ata_wq)
 		goto free_force_tbl;
 
-- 
1.6.4.173.g3f189


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 4/6] libata: use lazy workqueues for the pio task
  2009-08-20 10:20 ` [PATCH 4/6] libata: use lazy workqueues for the pio task Jens Axboe
@ 2009-08-20 12:40   ` Stefan Richter
  2009-08-20 12:48     ` Jens Axboe
  0 siblings, 1 reply; 30+ messages in thread
From: Stefan Richter @ 2009-08-20 12:40 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, jeff, benh, htejun, bzolnier, alan

Jens Axboe wrote:
> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
> ---
>  drivers/ata/libata-core.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
> index 072ba5e..35f74c9 100644
> --- a/drivers/ata/libata-core.c
> +++ b/drivers/ata/libata-core.c
> @@ -6580,7 +6580,7 @@ static int __init ata_init(void)
>  {
>  	ata_parse_force_param();
>  
> -	ata_wq = create_workqueue("ata");
> +	ata_wq = create_lazy_workqueue("ata");
>  	if (!ata_wq)
>  		goto free_force_tbl;
>  

However, this does not solve the issue of lacking parallelism on UP 
machines, does it?
-- 
Stefan Richter
-=====-==--= =--- =-=--
http://arcgraph.de/sr/

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 4/6] libata: use lazy workqueues for the pio task
  2009-08-20 12:40   ` Stefan Richter
@ 2009-08-20 12:48     ` Jens Axboe
  0 siblings, 0 replies; 30+ messages in thread
From: Jens Axboe @ 2009-08-20 12:48 UTC (permalink / raw)
  To: Stefan Richter; +Cc: linux-kernel, jeff, benh, htejun, bzolnier, alan

On Thu, Aug 20 2009, Stefan Richter wrote:
> Jens Axboe wrote:
>> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
>> ---
>>  drivers/ata/libata-core.c |    2 +-
>>  1 files changed, 1 insertions(+), 1 deletions(-)
>>
>> diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
>> index 072ba5e..35f74c9 100644
>> --- a/drivers/ata/libata-core.c
>> +++ b/drivers/ata/libata-core.c
>> @@ -6580,7 +6580,7 @@ static int __init ata_init(void)
>>  {
>>  	ata_parse_force_param();
>>  -	ata_wq = create_workqueue("ata");
>> +	ata_wq = create_lazy_workqueue("ata");
>>  	if (!ata_wq)
>>  		goto free_force_tbl;
>>  
>
> However, this does not solve the issue of lacking parallelism on UP  
> machines, does it?

No, the next step is needed there, having multiple threads. Pretty
similar to what Frederic described. Note that the current implementation
doesn't really solve that either, since work will be executed on the CPU
it is queued on. So there's no existing guarantee that it works, on UP
or SMP. This implementaion doesn't modify that behaviour, it's identical
to the current workqueue implementation in that respect.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 5/6] aio: use lazy workqueues
  2009-08-20 10:19 [PATCH 0/6] Lazy workqueues Jens Axboe
                   ` (3 preceding siblings ...)
  2009-08-20 10:20 ` [PATCH 4/6] libata: use lazy workqueues for the pio task Jens Axboe
@ 2009-08-20 10:20 ` Jens Axboe
  2009-08-20 15:09   ` Jeff Moyer
  2009-08-20 10:20 ` [PATCH 6/6] sunrpc: " Jens Axboe
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 30+ messages in thread
From: Jens Axboe @ 2009-08-20 10:20 UTC (permalink / raw)
  To: linux-kernel; +Cc: jeff, benh, htejun, bzolnier, alan, Jens Axboe

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
 fs/aio.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index d065b2c..4103b59 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -72,7 +72,7 @@ static int __init aio_setup(void)
 	kiocb_cachep = KMEM_CACHE(kiocb, SLAB_HWCACHE_ALIGN|SLAB_PANIC);
 	kioctx_cachep = KMEM_CACHE(kioctx,SLAB_HWCACHE_ALIGN|SLAB_PANIC);
 
-	aio_wq = create_workqueue("aio");
+	aio_wq = create_lazy_workqueue("aio");
 
 	pr_debug("aio_setup: sizeof(struct page) = %d\n", (int)sizeof(struct page));
 
-- 
1.6.4.173.g3f189


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/6] aio: use lazy workqueues
  2009-08-20 10:20 ` [PATCH 5/6] aio: use lazy workqueues Jens Axboe
@ 2009-08-20 15:09   ` Jeff Moyer
  2009-08-21 18:31     ` Zach Brown
  0 siblings, 1 reply; 30+ messages in thread
From: Jeff Moyer @ 2009-08-20 15:09 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, jeff, benh, htejun, bzolnier, alan, zach.brown

Jens Axboe <jens.axboe@oracle.com> writes:

> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
> ---
>  fs/aio.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/fs/aio.c b/fs/aio.c
> index d065b2c..4103b59 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -72,7 +72,7 @@ static int __init aio_setup(void)
>  	kiocb_cachep = KMEM_CACHE(kiocb, SLAB_HWCACHE_ALIGN|SLAB_PANIC);
>  	kioctx_cachep = KMEM_CACHE(kioctx,SLAB_HWCACHE_ALIGN|SLAB_PANIC);
>  
> -	aio_wq = create_workqueue("aio");
> +	aio_wq = create_lazy_workqueue("aio");
>  
>  	pr_debug("aio_setup: sizeof(struct page) = %d\n", (int)sizeof(struct page));

So far as I can tell, the aio workqueue isn't used for much these days.
We could probably get away with switching to keventd.  Zach, isn't
someone working on a patch to get rid of all of the -EIOCBRETRY
infrastructure?  That patch would probably make things clearer in this
area.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/6] aio: use lazy workqueues
  2009-08-20 15:09   ` Jeff Moyer
@ 2009-08-21 18:31     ` Zach Brown
  0 siblings, 0 replies; 30+ messages in thread
From: Zach Brown @ 2009-08-21 18:31 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: Jens Axboe, linux-kernel, jeff, benh, htejun, bzolnier, alan

> So far as I can tell, the aio workqueue isn't used for much these days.
> We could probably get away with switching to keventd.

It's only used by drivers/usb/gadget to implement O_DIRECT reads by
DMAing into kmalloc()ed memory and then performing the copy_to_user() in
the retry thread's task context after it has assumed the submitting
task's mm.

> Zach, isn't
> someone working on a patch to get rid of all of the -EIOCBRETRY
> infrastructure?  That patch would probably make things clearer in this
> area.

Yeah, a startling amount of fs/aio.c vanishes if we get rid of
EIOCBRETRY.  I'm puttering away at it, but I'll be on holiday next week
so it'll be a while before anything emerges.

- z

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 6/6] sunrpc: use lazy workqueues
  2009-08-20 10:19 [PATCH 0/6] Lazy workqueues Jens Axboe
                   ` (4 preceding siblings ...)
  2009-08-20 10:20 ` [PATCH 5/6] aio: use lazy workqueues Jens Axboe
@ 2009-08-20 10:20 ` Jens Axboe
  2009-08-20 12:04 ` [PATCH 0/6] Lazy workqueues Peter Zijlstra
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 30+ messages in thread
From: Jens Axboe @ 2009-08-20 10:20 UTC (permalink / raw)
  To: linux-kernel; +Cc: jeff, benh, htejun, bzolnier, alan, Jens Axboe

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
 net/sunrpc/sched.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index 8f459ab..ce99fe2 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -970,7 +970,7 @@ static int rpciod_start(void)
 	 * Create the rpciod thread and wait for it to start.
 	 */
 	dprintk("RPC:       creating workqueue rpciod\n");
-	wq = create_workqueue("rpciod");
+	wq = create_lazy_workqueue("rpciod");
 	rpciod_workqueue = wq;
 	return rpciod_workqueue != NULL;
 }
-- 
1.6.4.173.g3f189


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/6] Lazy workqueues
  2009-08-20 10:19 [PATCH 0/6] Lazy workqueues Jens Axboe
                   ` (5 preceding siblings ...)
  2009-08-20 10:20 ` [PATCH 6/6] sunrpc: " Jens Axboe
@ 2009-08-20 12:04 ` Peter Zijlstra
  2009-08-20 12:08   ` Jens Axboe
  2009-08-20 12:22 ` Frederic Weisbecker
  2009-08-20 12:55 ` Tejun Heo
  8 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2009-08-20 12:04 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, jeff, benh, htejun, bzolnier, alan

On Thu, 2009-08-20 at 12:19 +0200, Jens Axboe wrote:
> (sorry for the resend, but apparently the directory had some patches
>  in it already. plus, stupid git send-email doesn't default to
>  no chain replies, really annoying)

Newer versions should, I made a stink about this some time ago.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/6] Lazy workqueues
  2009-08-20 12:04 ` [PATCH 0/6] Lazy workqueues Peter Zijlstra
@ 2009-08-20 12:08   ` Jens Axboe
  2009-08-20 12:16     ` Peter Zijlstra
  0 siblings, 1 reply; 30+ messages in thread
From: Jens Axboe @ 2009-08-20 12:08 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, jeff, benh, htejun, bzolnier, alan

On Thu, Aug 20 2009, Peter Zijlstra wrote:
> On Thu, 2009-08-20 at 12:19 +0200, Jens Axboe wrote:
> > (sorry for the resend, but apparently the directory had some patches
> >  in it already. plus, stupid git send-email doesn't default to
> >  no chain replies, really annoying)
> 
> Newer versions should, I made a stink about this some time ago.

git version 1.6.4.173.g3f189

That's pretty new... But perhaps I should complain too, it's been
annoying me forever.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/6] Lazy workqueues
  2009-08-20 12:08   ` Jens Axboe
@ 2009-08-20 12:16     ` Peter Zijlstra
  2009-08-23  2:42       ` Junio C Hamano
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2009-08-20 12:16 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-kernel, jeff, benh, htejun, bzolnier, alan, Junio C Hamano

On Thu, 2009-08-20 at 14:08 +0200, Jens Axboe wrote:
> On Thu, Aug 20 2009, Peter Zijlstra wrote:
> > On Thu, 2009-08-20 at 12:19 +0200, Jens Axboe wrote:
> > > (sorry for the resend, but apparently the directory had some patches
> > >  in it already. plus, stupid git send-email doesn't default to
> > >  no chain replies, really annoying)
> > 
> > Newer versions should, I made a stink about this some time ago.
> 
> git version 1.6.4.173.g3f189
> 
> That's pretty new... But perhaps I should complain too, it's been
> annoying me forever.

http://marc.info/?l=git&m=123457137328461&w=2

Apparently it didn't happen, nor did I ever see a reply to that posting.

Junio, what happened here?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/6] Lazy workqueues
  2009-08-20 12:16     ` Peter Zijlstra
@ 2009-08-23  2:42       ` Junio C Hamano
  2009-08-24  7:04         ` git send-email defaults Peter Zijlstra
  2009-08-24  8:04         ` [PATCH 0/6] Lazy workqueues Jens Axboe
  0 siblings, 2 replies; 30+ messages in thread
From: Junio C Hamano @ 2009-08-23  2:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jens Axboe, linux-kernel, jeff, benh, htejun, bzolnier, alan

Peter Zijlstra <peterz@infradead.org> writes:

> On Thu, 2009-08-20 at 14:08 +0200, Jens Axboe wrote:
> ...
>> That's pretty new... But perhaps I should complain too, it's been
>> annoying me forever.
>
> http://marc.info/?l=git&m=123457137328461&w=2
>
> Apparently it didn't happen, nor did I ever see a reply to that posting.
>
> Junio, what happened here?

Nothing happened.

I do not recall anybody objecting to, but then when nothing happened in
neither 1.6.3 nor 1.6.4, nobody jumped up-and-down demanding the change of
default either.  So overall impression I got from this was that nobody
really cared deeply enough either way.

But we are talking about 1.7.0 to become a release to correct wrong
defaults we have had once and for all ;-), and I am tempted to roll this
topic into the mix.  Here is what I queued to my 'next' branch tonight.

-- >8 --
From: Junio C Hamano <gitster@pobox.com>
Date: Sat, 22 Aug 2009 12:48:48 -0700
Subject: [PATCH] send-email: make --no-chain-reply-to the default

In http://article.gmane.org/gmane.comp.version-control.git/109790 I
threatened to announce a change to the default threading style used by
send-email to no-chain-reply-to (i.e. the second and subsequent messages
will all be replies to the first one), unless nobody objected, in 1.6.3.

Nobody objected, as far as I can dig the list archive.  But when nothing
happened in 1.6.3 nor 1.6.4, nobody from the camp who complained loudly
that led to the message did not complain either.

So I am guessing that after all nobody cares about this.  But 1.7.0 is a
good time to change this, and as I said in the message, I personally think
it is a good change, so here it is.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 Documentation/git-send-email.txt |    6 +++---
 git-send-email.perl              |    4 ++--
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/Documentation/git-send-email.txt b/Documentation/git-send-email.txt
index 767cf4d..626c2dc 100644
--- a/Documentation/git-send-email.txt
+++ b/Documentation/git-send-email.txt
@@ -84,7 +84,7 @@ See the CONFIGURATION section for 'sendemail.multiedit'.
 --in-reply-to=<identifier>::
 	Specify the contents of the first In-Reply-To header.
 	Subsequent emails will refer to the previous email
-	instead of this if --chain-reply-to is set (the default)
+	instead of this if --chain-reply-to is set.
 	Only necessary if --compose is also set.  If --compose
 	is not set, this will be prompted for.
 
@@ -171,8 +171,8 @@ Automating
 	email sent.  If disabled with "--no-chain-reply-to", all emails after
 	the first will be sent as replies to the first email sent.  When using
 	this, it is recommended that the first file given be an overview of the
-	entire patch series. Default is the value of the 'sendemail.chainreplyto'
-	configuration value; if that is unspecified, default to --chain-reply-to.
+	entire patch series. Disabled by default, but the 'sendemail.chainreplyto'
+	configuration variable can be used to enable it.
 
 --identity=<identity>::
 	A configuration identity. When given, causes values in the
diff --git a/git-send-email.perl b/git-send-email.perl
index 0700d80..c1d0930 100755
--- a/git-send-email.perl
+++ b/git-send-email.perl
@@ -71,7 +71,7 @@ git send-email [options] <file | directory | rev-list options >
     --suppress-cc           <str>  * author, self, sob, cc, cccmd, body, bodycc, all.
     --[no-]signed-off-by-cc        * Send to Signed-off-by: addresses. Default on.
     --[no-]suppress-from           * Send to self. Default off.
-    --[no-]chain-reply-to          * Chain In-Reply-To: fields. Default on.
+    --[no-]chain-reply-to          * Chain In-Reply-To: fields. Default off.
     --[no-]thread                  * Use In-Reply-To: field. Default on.
 
   Administering:
@@ -188,7 +188,7 @@ my (@suppress_cc);
 
 my %config_bool_settings = (
     "thread" => [\$thread, 1],
-    "chainreplyto" => [\$chain_reply_to, 1],
+    "chainreplyto" => [\$chain_reply_to, undef],
     "suppressfrom" => [\$suppress_from, undef],
     "signedoffbycc" => [\$signed_off_by_cc, undef],
     "signedoffcc" => [\$signed_off_by_cc, undef],      # Deprecated
-- 
1.6.4.1.255.g5556a



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* git send-email defaults
  2009-08-23  2:42       ` Junio C Hamano
@ 2009-08-24  7:04         ` Peter Zijlstra
  2009-08-24  8:04         ` [PATCH 0/6] Lazy workqueues Jens Axboe
  1 sibling, 0 replies; 30+ messages in thread
From: Peter Zijlstra @ 2009-08-24  7:04 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jens Axboe, linux-kernel, jeff, benh, htejun, bzolnier, alan

On Sat, 2009-08-22 at 19:42 -0700, Junio C Hamano wrote:
> Peter Zijlstra <peterz@infradead.org> writes:
> 
> > On Thu, 2009-08-20 at 14:08 +0200, Jens Axboe wrote:
> > ...
> >> That's pretty new... But perhaps I should complain too, it's been
> >> annoying me forever.
> >
> > http://marc.info/?l=git&m=123457137328461&w=2
> >
> > Apparently it didn't happen, nor did I ever see a reply to that posting.
> >
> > Junio, what happened here?
> 
> Nothing happened.
> 
> I do not recall anybody objecting to, but then when nothing happened in
> neither 1.6.3 nor 1.6.4, nobody jumped up-and-down demanding the change of
> default either.  So overall impression I got from this was that nobody
> really cared deeply enough either way.

And here I was thinking it was settled when no objections came ;-)

> But we are talking about 1.7.0 to become a release to correct wrong
> defaults we have had once and for all ;-), and I am tempted to roll this
> topic into the mix.  Here is what I queued to my 'next' branch tonight.

The sooner this hits the distros the better.. 

Thanks for committing the change, looking fwd to 1.7

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/6] Lazy workqueues
  2009-08-23  2:42       ` Junio C Hamano
  2009-08-24  7:04         ` git send-email defaults Peter Zijlstra
@ 2009-08-24  8:04         ` Jens Axboe
  2009-08-24  9:03           ` Junio C Hamano
  1 sibling, 1 reply; 30+ messages in thread
From: Jens Axboe @ 2009-08-24  8:04 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Peter Zijlstra, linux-kernel, jeff, benh, htejun, bzolnier, alan

On Sat, Aug 22 2009, Junio C Hamano wrote:
> Peter Zijlstra <peterz@infradead.org> writes:
> 
> > On Thu, 2009-08-20 at 14:08 +0200, Jens Axboe wrote:
> > ...
> >> That's pretty new... But perhaps I should complain too, it's been
> >> annoying me forever.
> >
> > http://marc.info/?l=git&m=123457137328461&w=2
> >
> > Apparently it didn't happen, nor did I ever see a reply to that posting.
> >
> > Junio, what happened here?
> 
> Nothing happened.
> 
> I do not recall anybody objecting to, but then when nothing happened in
> neither 1.6.3 nor 1.6.4, nobody jumped up-and-down demanding the change of
> default either.  So overall impression I got from this was that nobody
> really cared deeply enough either way.

That's some strange logic right there :-). Of course nobody complained,
they thought it was a done deal.

> But we are talking about 1.7.0 to become a release to correct wrong
> defaults we have had once and for all ;-), and I am tempted to roll this
> topic into the mix.  Here is what I queued to my 'next' branch tonight.

OK that's at least something, looking forward to being able to prune
that argument from my scripts. It completely destroys viewability of
larger patchsets.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/6] Lazy workqueues
  2009-08-24  8:04         ` [PATCH 0/6] Lazy workqueues Jens Axboe
@ 2009-08-24  9:03           ` Junio C Hamano
  2009-08-24  9:11             ` Peter Zijlstra
  0 siblings, 1 reply; 30+ messages in thread
From: Junio C Hamano @ 2009-08-24  9:03 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Peter Zijlstra, linux-kernel, jeff, benh, htejun, bzolnier, alan

Jens Axboe <jens.axboe@oracle.com> writes:

> OK that's at least something, looking forward to being able to prune
> that argument from my scripts.

Ahahh.

An option everybody will want to pass but is prone to be forgotten and
hard to type from the command line is one thing, but if you are scripting
in order to reuse the script over and over, that is a separate story.  Is
losing an option from your script really the goal of this fuss?

In any case, you not need to wait for a new version nor a patch at all for
that goal.  You can simply add

[sendemail]
	chainreplyto = no

to your .git/config (or $HOME/.gitconfig).  Both your script and your
command line invocation will default not to create deep threads with the
setting.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/6] Lazy workqueues
  2009-08-24  9:03           ` Junio C Hamano
@ 2009-08-24  9:11             ` Peter Zijlstra
  0 siblings, 0 replies; 30+ messages in thread
From: Peter Zijlstra @ 2009-08-24  9:11 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jens Axboe, linux-kernel, jeff, benh, htejun, bzolnier, alan

On Mon, 2009-08-24 at 02:03 -0700, Junio C Hamano wrote:
> Jens Axboe <jens.axboe@oracle.com> writes:
> 
> > OK that's at least something, looking forward to being able to prune
> > that argument from my scripts.
> 
> Ahahh.
> 
> An option everybody will want to pass but is prone to be forgotten and
> hard to type from the command line is one thing, but if you are scripting
> in order to reuse the script over and over, that is a separate story.  Is
> losing an option from your script really the goal of this fuss?
> 
> In any case, you not need to wait for a new version nor a patch at all for
> that goal.  You can simply add
> 
> [sendemail]
> 	chainreplyto = no
> 
> to your .git/config (or $HOME/.gitconfig).  Both your script and your
> command line invocation will default not to create deep threads with the
> setting.

For me its about getting the default right, because lots of people
simply use git-send-email without scrips, and often .gitconfig gets lost
or simply doesn't get carried around the various development machines.

Also, it stop every new person mailing patches from having to be told to
flip that setting.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/6] Lazy workqueues
  2009-08-20 10:19 [PATCH 0/6] Lazy workqueues Jens Axboe
                   ` (6 preceding siblings ...)
  2009-08-20 12:04 ` [PATCH 0/6] Lazy workqueues Peter Zijlstra
@ 2009-08-20 12:22 ` Frederic Weisbecker
  2009-08-20 12:41   ` Jens Axboe
  2009-08-20 12:59   ` Steven Whitehouse
  2009-08-20 12:55 ` Tejun Heo
  8 siblings, 2 replies; 30+ messages in thread
From: Frederic Weisbecker @ 2009-08-20 12:22 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-kernel, jeff, benh, htejun, bzolnier, alan, Andrew Morton,
	Oleg Nesterov

On Thu, Aug 20, 2009 at 12:19:58PM +0200, Jens Axboe wrote:
> (sorry for the resend, but apparently the directory had some patches
>  in it already. plus, stupid git send-email doesn't default to
>  no chain replies, really annoying)
> 
> Hi,
> 
> After yesterdays rant on having too many kernel threads and checking
> how many I actually have running on this system (531!), I decided to 
> try and do something about it.
> 
> My goal was to retain the workqueue interface instead of coming up with
> a new scheme that required conversion (or converting to slow_work which,
> btw, is an awful name :-). I also wanted to retain the affinity
> guarantees of workqueues as much as possible.
> 
> So this is a first step in that direction, it's probably full of races
> and holes, but should get the idea across. It adds a
> create_lazy_workqueue() helper, similar to the other variants that we
> currently have. A lazy workqueue works like a normal workqueue, except
> that it only (by default) starts a core thread instead of threads for
> all online CPUs. When work is queued on a lazy workqueue for a CPU
> that doesn't have a thread running, it will be placed on the core CPUs
> list and that will then create and move the work to the right target.
> Should task creation fail, the queued work will be executed on the
> core CPU instead. Once a lazy workqueue thread has been idle for a
> certain amount of time, it will again exit.
> 
> The patch boots here and I exercised the rpciod workqueue and
> verified that it gets created, runs on the right CPU, and exits a while
> later. So core functionality should be there, even if it has holes.
> 
> With this patchset, I am now down to 280 kernel threads on one of my test
> boxes. Still too many, but it's a start and a net reduction of 251
> threads here, or 47%!
> 
> The code can also be pulled from:
> 
>   git://git.kernel.dk/linux-2.6-block.git workqueue
> 
> -- 
> Jens Axboe

That looks like a nice idea that may indeed solve the problem of thread
proliferation with per cpu workqueue.

Now I think there is another problem that taint the workqueues from the
beginning which is the deadlocks induced by one work that waits another
one in the same workqueue. And since the workqueues are executing the jobs
by serializing, the effect is deadlocks.

Often, drivers need to move from the central events/%d to a dedicated workqueue
because of that.

A idea to solve this:

We could have one thread per struct work_struct.
Similarly to this patchset, this thread waits for queuing requests, but only for
this work struct.
If the target cpu has no thread for this work, then create one, like you do, etc...

Then the idea is to have one workqueue per struct work_struct, which handles
per cpu task creation, etc... And this workqueue only handles the given work.

That may solve the deadlocks scenario that are often reported and lead to
dedicated workqueue creation.

That also makes disappearing the work execution serialization between different
worklets. We just keep the serialization between same work, which seems a
pretty natural thing and is less haphazard than multiple works of different
natures randomly serialized between them.

Note the effect would not only be a reducing of deadlocks but also probably
an increasing of throughput because works of different natures won't need anymore
to wait for the previous one completion.

Also a reducing of latency (a high prio work that waits for a lower prio
work).

There are good chances that we won't need any more per driver/subsys workqueue
creation after that, because everything would be per worklet.
We could use a single schedule_work() for all of them and not bother choosing
a specific workqueue or the central events/%d

Hmm?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/6] Lazy workqueues
  2009-08-20 12:22 ` Frederic Weisbecker
@ 2009-08-20 12:41   ` Jens Axboe
  2009-08-20 13:04     ` Tejun Heo
  2009-08-20 12:59   ` Steven Whitehouse
  1 sibling, 1 reply; 30+ messages in thread
From: Jens Axboe @ 2009-08-20 12:41 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: linux-kernel, jeff, benh, htejun, bzolnier, alan, Andrew Morton,
	Oleg Nesterov

On Thu, Aug 20 2009, Frederic Weisbecker wrote:
> On Thu, Aug 20, 2009 at 12:19:58PM +0200, Jens Axboe wrote:
> > (sorry for the resend, but apparently the directory had some patches
> >  in it already. plus, stupid git send-email doesn't default to
> >  no chain replies, really annoying)
> > 
> > Hi,
> > 
> > After yesterdays rant on having too many kernel threads and checking
> > how many I actually have running on this system (531!), I decided to 
> > try and do something about it.
> > 
> > My goal was to retain the workqueue interface instead of coming up with
> > a new scheme that required conversion (or converting to slow_work which,
> > btw, is an awful name :-). I also wanted to retain the affinity
> > guarantees of workqueues as much as possible.
> > 
> > So this is a first step in that direction, it's probably full of races
> > and holes, but should get the idea across. It adds a
> > create_lazy_workqueue() helper, similar to the other variants that we
> > currently have. A lazy workqueue works like a normal workqueue, except
> > that it only (by default) starts a core thread instead of threads for
> > all online CPUs. When work is queued on a lazy workqueue for a CPU
> > that doesn't have a thread running, it will be placed on the core CPUs
> > list and that will then create and move the work to the right target.
> > Should task creation fail, the queued work will be executed on the
> > core CPU instead. Once a lazy workqueue thread has been idle for a
> > certain amount of time, it will again exit.
> > 
> > The patch boots here and I exercised the rpciod workqueue and
> > verified that it gets created, runs on the right CPU, and exits a while
> > later. So core functionality should be there, even if it has holes.
> > 
> > With this patchset, I am now down to 280 kernel threads on one of my test
> > boxes. Still too many, but it's a start and a net reduction of 251
> > threads here, or 47%!
> > 
> > The code can also be pulled from:
> > 
> >   git://git.kernel.dk/linux-2.6-block.git workqueue
> > 
> > -- 
> > Jens Axboe
> 
> 
> That looks like a nice idea that may indeed solve the problem of
> thread proliferation with per cpu workqueue.
> 
> Now I think there is another problem that taint the workqueues from
> the beginning which is the deadlocks induced by one work that waits
> another one in the same workqueue. And since the workqueues are
> executing the jobs by serializing, the effect is deadlocks.
> 
> Often, drivers need to move from the central events/%d to a dedicated
> workqueue because of that.
> 
> A idea to solve this:
> 
> We could have one thread per struct work_struct.  Similarly to this
> patchset, this thread waits for queuing requests, but only for this
> work struct.  If the target cpu has no thread for this work, then
> create one, like you do, etc...
> 
> Then the idea is to have one workqueue per struct work_struct, which
> handles per cpu task creation, etc... And this workqueue only handles
> the given work.
> 
> That may solve the deadlocks scenario that are often reported and lead
> to dedicated workqueue creation.
> 
> That also makes disappearing the work execution serialization between
> different worklets. We just keep the serialization between same work,
> which seems a pretty natural thing and is less haphazard than multiple
> works of different natures randomly serialized between them.
> 
> Note the effect would not only be a reducing of deadlocks but also
> probably an increasing of throughput because works of different
> natures won't need anymore to wait for the previous one completion.
> 
> Also a reducing of latency (a high prio work that waits for a lower
> prio work).
> 
> There are good chances that we won't need any more per driver/subsys
> workqueue creation after that, because everything would be per
> worklet.  We could use a single schedule_work() for all of them and
> not bother choosing a specific workqueue or the central events/%d
> 
> Hmm?

I pretty much agree with you, my initial plan for a thread pool would be
very similar. I'll gradually work towards that goal.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/6] Lazy workqueues
  2009-08-20 12:41   ` Jens Axboe
@ 2009-08-20 13:04     ` Tejun Heo
  0 siblings, 0 replies; 30+ messages in thread
From: Tejun Heo @ 2009-08-20 13:04 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Frederic Weisbecker, linux-kernel, jeff, benh, bzolnier, alan,
	Andrew Morton, Oleg Nesterov

Jens Axboe wrote:
> On Thu, Aug 20 2009, Frederic Weisbecker wrote:
>> A idea to solve this:
>>
>> We could have one thread per struct work_struct.  Similarly to this
>> patchset, this thread waits for queuing requests, but only for this
>> work struct.  If the target cpu has no thread for this work, then
>> create one, like you do, etc...
>>
>> Then the idea is to have one workqueue per struct work_struct, which
>> handles per cpu task creation, etc... And this workqueue only handles
>> the given work.
>>
>> That may solve the deadlocks scenario that are often reported and lead
>> to dedicated workqueue creation.
>>
>> That also makes disappearing the work execution serialization between
>> different worklets. We just keep the serialization between same work,
>> which seems a pretty natural thing and is less haphazard than multiple
>> works of different natures randomly serialized between them.
>>
>> Note the effect would not only be a reducing of deadlocks but also
>> probably an increasing of throughput because works of different
>> natures won't need anymore to wait for the previous one completion.
>>
>> Also a reducing of latency (a high prio work that waits for a lower
>> prio work).
>>
>> There are good chances that we won't need any more per driver/subsys
>> workqueue creation after that, because everything would be per
>> worklet.  We could use a single schedule_work() for all of them and
>> not bother choosing a specific workqueue or the central events/%d
>>
>> Hmm?
> 
> I pretty much agree with you, my initial plan for a thread pool would be
> very similar. I'll gradually work towards that goal.

Several issues that come to my mind with the above approach are...

* There will still be cases where you need fixed dedicated thread.
  Execution resources for anything which might be used during IO needs
  to be preallocated (at least some of it) to guarantee forward
  progress.

* Depending on how popular works are used (and I think their usage
  will grow with improvements like this), we might end up with many
  idling threads again and please note that thread creation /
  destruction is quite costly compared to what works usually do.

* Having different threads executing different works all the time
  might improve latency but if works are used frequently enough it's
  likely to lower throughput because short works which can be handled
  in batch by a single thread now needs to be handled by different
  threads.  Scheduling overhead can be significant compared to what
  those works actually do and it will also cost much more cache
  footprint-wise.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/6] Lazy workqueues
  2009-08-20 12:22 ` Frederic Weisbecker
  2009-08-20 12:41   ` Jens Axboe
@ 2009-08-20 12:59   ` Steven Whitehouse
  1 sibling, 0 replies; 30+ messages in thread
From: Steven Whitehouse @ 2009-08-20 12:59 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Jens Axboe, linux-kernel, jeff, benh, htejun, bzolnier, alan,
	Andrew Morton, Oleg Nesterov

Hi,

On Thu, 2009-08-20 at 14:22 +0200, Frederic Weisbecker wrote:
> On Thu, Aug 20, 2009 at 12:19:58PM +0200, Jens Axboe wrote:
> > (sorry for the resend, but apparently the directory had some patches
> >  in it already. plus, stupid git send-email doesn't default to
> >  no chain replies, really annoying)
> > 
> > Hi,
> > 
> > After yesterdays rant on having too many kernel threads and checking
> > how many I actually have running on this system (531!), I decided to 
> > try and do something about it.
> > 
> > My goal was to retain the workqueue interface instead of coming up with
> > a new scheme that required conversion (or converting to slow_work which,
> > btw, is an awful name :-). I also wanted to retain the affinity
> > guarantees of workqueues as much as possible.
> > 
> > So this is a first step in that direction, it's probably full of races
> > and holes, but should get the idea across. It adds a
> > create_lazy_workqueue() helper, similar to the other variants that we
> > currently have. A lazy workqueue works like a normal workqueue, except
> > that it only (by default) starts a core thread instead of threads for
> > all online CPUs. When work is queued on a lazy workqueue for a CPU
> > that doesn't have a thread running, it will be placed on the core CPUs
> > list and that will then create and move the work to the right target.
> > Should task creation fail, the queued work will be executed on the
> > core CPU instead. Once a lazy workqueue thread has been idle for a
> > certain amount of time, it will again exit.
> > 
> > The patch boots here and I exercised the rpciod workqueue and
> > verified that it gets created, runs on the right CPU, and exits a while
> > later. So core functionality should be there, even if it has holes.
> > 
> > With this patchset, I am now down to 280 kernel threads on one of my test
> > boxes. Still too many, but it's a start and a net reduction of 251
> > threads here, or 47%!
> > 
> > The code can also be pulled from:
> > 
> >   git://git.kernel.dk/linux-2.6-block.git workqueue
> > 
> > -- 
> > Jens Axboe
> 
> 
> That looks like a nice idea that may indeed solve the problem of thread
> proliferation with per cpu workqueue.
> 
> Now I think there is another problem that taint the workqueues from the
> beginning which is the deadlocks induced by one work that waits another
> one in the same workqueue. And since the workqueues are executing the jobs
> by serializing, the effect is deadlocks.
> 
In GFS2 we've also got an additional issue. We cannot create threads at
the point of use (or let pending work block on thread creation) because
it implies a GFP_KERNEL memory allocation which could call back into the
fs. This is a particular issue with journal recovery (which uses
slow_work now, older versions used a private thread) and the code which
deals with inodes which have been unlinked remotely.

In addition to that the glock workqueue which we are using would be much
better turned into a tasklet, or similar. The reason why we cannot do
this is that submission of block I/O is only possible from process
context. At some stage it might be possible to partially solve the
problem by separating the parts of the state machine which submit I/O
from those which don't, but I'm not convinced that effort it worth it.

There is also the issue of ordering of I/O requests. The glocks are (for
those which submit I/O) in a 1:1 relationship with inodes or resource
groups and thus indexed by disk block number. I have considered in the
past, creating a workqueue with an elevator based work submission
interface. This would greatly improve the I/O patterns created by
multiple submissions of glock work items. In particular it would make a
big difference when the glock shrinker marks dirty glocks for removal
from the glock cache (under memory pressure) or when processing large
numbers of remote callbacks.

I've not yet come to any conclusion as to whether the "elevator
workqueue" is a good idea or not, any suggestions of a better solution
are very welcome,

Steve.



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/6] Lazy workqueues
  2009-08-20 10:19 [PATCH 0/6] Lazy workqueues Jens Axboe
                   ` (7 preceding siblings ...)
  2009-08-20 12:22 ` Frederic Weisbecker
@ 2009-08-20 12:55 ` Tejun Heo
  2009-08-21  6:58   ` Jens Axboe
  8 siblings, 1 reply; 30+ messages in thread
From: Tejun Heo @ 2009-08-20 12:55 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, jeff, benh, bzolnier, alan

Hello, Jens.

Jens Axboe wrote:
> After yesterdays rant on having too many kernel threads and checking
> how many I actually have running on this system (531!), I decided to 
> try and do something about it.

Heh... that's a lot.  How many cpus do you have there?  Care to share
the output of "ps -ef"?

> My goal was to retain the workqueue interface instead of coming up with
> a new scheme that required conversion (or converting to slow_work which,
> btw, is an awful name :-). I also wanted to retain the affinity
> guarantees of workqueues as much as possible.
> 
> So this is a first step in that direction, it's probably full of races
> and holes, but should get the idea across. It adds a
> create_lazy_workqueue() helper, similar to the other variants that we
> currently have. A lazy workqueue works like a normal workqueue, except
> that it only (by default) starts a core thread instead of threads for
> all online CPUs. When work is queued on a lazy workqueue for a CPU
> that doesn't have a thread running, it will be placed on the core CPUs
> list and that will then create and move the work to the right target.
> Should task creation fail, the queued work will be executed on the
> core CPU instead. Once a lazy workqueue thread has been idle for a
> certain amount of time, it will again exit.

Yeap, the approach seems simple and nice and resolves the problem of
too many idle workers.

> The patch boots here and I exercised the rpciod workqueue and
> verified that it gets created, runs on the right CPU, and exits a while
> later. So core functionality should be there, even if it has holes.
> 
> With this patchset, I am now down to 280 kernel threads on one of my test
> boxes. Still too many, but it's a start and a net reduction of 251
> threads here, or 47%!

I'm trying to find out whether the perfect concurrency idea I talked
about on the other thread can be implemented in reasonable manner.
Would you mind holding for a few days before investing too much effort
into expanding this one to handle multiple workers?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/6] Lazy workqueues
  2009-08-20 12:55 ` Tejun Heo
@ 2009-08-21  6:58   ` Jens Axboe
  0 siblings, 0 replies; 30+ messages in thread
From: Jens Axboe @ 2009-08-21  6:58 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, jeff, benh, bzolnier, alan

[-- Attachment #1: Type: text/plain, Size: 2757 bytes --]

On Thu, Aug 20 2009, Tejun Heo wrote:
> Hello, Jens.
> 
> Jens Axboe wrote:
> > After yesterdays rant on having too many kernel threads and checking
> > how many I actually have running on this system (531!), I decided to 
> > try and do something about it.
> 
> Heh... that's a lot.  How many cpus do you have there?  Care to share
> the output of "ps -ef"?

That system has 64 cpus. ps -ef attached.

> > My goal was to retain the workqueue interface instead of coming up with
> > a new scheme that required conversion (or converting to slow_work which,
> > btw, is an awful name :-). I also wanted to retain the affinity
> > guarantees of workqueues as much as possible.
> > 
> > So this is a first step in that direction, it's probably full of races
> > and holes, but should get the idea across. It adds a
> > create_lazy_workqueue() helper, similar to the other variants that we
> > currently have. A lazy workqueue works like a normal workqueue, except
> > that it only (by default) starts a core thread instead of threads for
> > all online CPUs. When work is queued on a lazy workqueue for a CPU
> > that doesn't have a thread running, it will be placed on the core CPUs
> > list and that will then create and move the work to the right target.
> > Should task creation fail, the queued work will be executed on the
> > core CPU instead. Once a lazy workqueue thread has been idle for a
> > certain amount of time, it will again exit.
> 
> Yeap, the approach seems simple and nice and resolves the problem of
> too many idle workers.

I think so too :-)

> > The patch boots here and I exercised the rpciod workqueue and
> > verified that it gets created, runs on the right CPU, and exits a while
> > later. So core functionality should be there, even if it has holes.
> > 
> > With this patchset, I am now down to 280 kernel threads on one of my test
> > boxes. Still too many, but it's a start and a net reduction of 251
> > threads here, or 47%!
> 
> I'm trying to find out whether the perfect concurrency idea I talked
> about on the other thread can be implemented in reasonable manner.
> Would you mind holding for a few days before investing too much effort
> into expanding this one to handle multiple workers?

No problem, I'll just get the races closed up in the existing version.

I think we basically have two classes of users here - one that the
existing workqueue scheme works well for, high performance work
execution where CPU affinity matters. The other is just slow work
execution (like the libata pio task stuff), which would be better
handled by a generic thread pool implementation. I think we should start
converting those users to slow_work, in fact I think I'll try libata to
try and set a good example :-)

-- 
Jens Axboe


[-- Attachment #2: ps-ef.txt --]
[-- Type: text/plain, Size: 25896 bytes --]

UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  3 09:53 ?        00:00:06 init [2]  
root         2     0  0 09:53 ?        00:00:00 [kthreadd]
root         3     2  0 09:53 ?        00:00:00 [migration/0]
root         4     2  0 09:53 ?        00:00:00 [ksoftirqd/0]
root         5     2  0 09:53 ?        00:00:00 [watchdog/0]
root         6     2  0 09:53 ?        00:00:00 [migration/1]
root         7     2  0 09:53 ?        00:00:00 [ksoftirqd/1]
root         8     2  0 09:53 ?        00:00:00 [watchdog/1]
root         9     2  0 09:53 ?        00:00:00 [migration/2]
root        10     2  0 09:53 ?        00:00:00 [ksoftirqd/2]
root        11     2  0 09:53 ?        00:00:00 [watchdog/2]
root        12     2  0 09:53 ?        00:00:00 [migration/3]
root        13     2  0 09:53 ?        00:00:00 [ksoftirqd/3]
root        14     2  0 09:53 ?        00:00:00 [watchdog/3]
root        15     2  0 09:53 ?        00:00:00 [migration/4]
root        16     2  0 09:53 ?        00:00:00 [ksoftirqd/4]
root        17     2  0 09:53 ?        00:00:00 [watchdog/4]
root        18     2  0 09:53 ?        00:00:00 [migration/5]
root        19     2  0 09:53 ?        00:00:00 [ksoftirqd/5]
root        20     2  0 09:53 ?        00:00:00 [watchdog/5]
root        21     2  0 09:53 ?        00:00:00 [migration/6]
root        22     2  0 09:53 ?        00:00:00 [ksoftirqd/6]
root        23     2  0 09:53 ?        00:00:00 [watchdog/6]
root        24     2  0 09:53 ?        00:00:00 [migration/7]
root        25     2  0 09:53 ?        00:00:00 [ksoftirqd/7]
root        26     2  0 09:53 ?        00:00:00 [watchdog/7]
root        27     2  0 09:53 ?        00:00:00 [migration/8]
root        28     2  0 09:53 ?        00:00:00 [ksoftirqd/8]
root        29     2  0 09:53 ?        00:00:00 [watchdog/8]
root        30     2  0 09:53 ?        00:00:00 [migration/9]
root        31     2  0 09:53 ?        00:00:00 [ksoftirqd/9]
root        32     2  0 09:53 ?        00:00:00 [watchdog/9]
root        33     2  0 09:53 ?        00:00:00 [migration/10]
root        34     2  0 09:53 ?        00:00:00 [ksoftirqd/10]
root        35     2  0 09:53 ?        00:00:00 [watchdog/10]
root        36     2  0 09:53 ?        00:00:00 [migration/11]
root        37     2  0 09:53 ?        00:00:00 [ksoftirqd/11]
root        38     2  0 09:53 ?        00:00:00 [watchdog/11]
root        39     2  0 09:53 ?        00:00:00 [migration/12]
root        40     2  0 09:53 ?        00:00:00 [ksoftirqd/12]
root        41     2  0 09:53 ?        00:00:00 [watchdog/12]
root        42     2  0 09:53 ?        00:00:00 [migration/13]
root        43     2  0 09:53 ?        00:00:00 [ksoftirqd/13]
root        44     2  0 09:53 ?        00:00:00 [watchdog/13]
root        45     2  0 09:53 ?        00:00:00 [migration/14]
root        46     2  0 09:53 ?        00:00:00 [ksoftirqd/14]
root        47     2  0 09:53 ?        00:00:00 [watchdog/14]
root        48     2  0 09:53 ?        00:00:00 [migration/15]
root        49     2  0 09:53 ?        00:00:00 [ksoftirqd/15]
root        50     2  0 09:53 ?        00:00:00 [watchdog/15]
root        51     2  0 09:53 ?        00:00:00 [migration/16]
root        52     2  0 09:53 ?        00:00:00 [ksoftirqd/16]
root        53     2  0 09:53 ?        00:00:00 [watchdog/16]
root        54     2  0 09:53 ?        00:00:00 [migration/17]
root        55     2  0 09:53 ?        00:00:00 [ksoftirqd/17]
root        56     2  0 09:53 ?        00:00:00 [watchdog/17]
root        57     2  0 09:53 ?        00:00:00 [migration/18]
root        58     2  0 09:53 ?        00:00:00 [ksoftirqd/18]
root        59     2  0 09:53 ?        00:00:00 [watchdog/18]
root        60     2  0 09:53 ?        00:00:00 [migration/19]
root        61     2  0 09:53 ?        00:00:00 [ksoftirqd/19]
root        62     2  0 09:53 ?        00:00:00 [watchdog/19]
root        63     2  0 09:53 ?        00:00:00 [migration/20]
root        64     2  0 09:53 ?        00:00:00 [ksoftirqd/20]
root        65     2  0 09:53 ?        00:00:00 [watchdog/20]
root        66     2  0 09:53 ?        00:00:00 [migration/21]
root        67     2  0 09:53 ?        00:00:00 [ksoftirqd/21]
root        68     2  0 09:53 ?        00:00:00 [watchdog/21]
root        69     2  0 09:53 ?        00:00:00 [migration/22]
root        70     2  0 09:53 ?        00:00:00 [ksoftirqd/22]
root        71     2  0 09:53 ?        00:00:00 [watchdog/22]
root        72     2  0 09:53 ?        00:00:00 [migration/23]
root        73     2  0 09:53 ?        00:00:00 [ksoftirqd/23]
root        74     2  0 09:53 ?        00:00:00 [watchdog/23]
root        75     2  0 09:53 ?        00:00:00 [migration/24]
root        76     2  0 09:53 ?        00:00:00 [ksoftirqd/24]
root        77     2  0 09:53 ?        00:00:00 [watchdog/24]
root        78     2  0 09:53 ?        00:00:00 [migration/25]
root        79     2  0 09:53 ?        00:00:00 [ksoftirqd/25]
root        80     2  0 09:53 ?        00:00:00 [watchdog/25]
root        81     2  0 09:53 ?        00:00:00 [migration/26]
root        82     2  0 09:53 ?        00:00:00 [ksoftirqd/26]
root        83     2  0 09:53 ?        00:00:00 [watchdog/26]
root        84     2  0 09:53 ?        00:00:00 [migration/27]
root        85     2  0 09:53 ?        00:00:00 [ksoftirqd/27]
root        86     2  0 09:53 ?        00:00:00 [watchdog/27]
root        87     2  0 09:53 ?        00:00:00 [migration/28]
root        88     2  0 09:53 ?        00:00:00 [ksoftirqd/28]
root        89     2  0 09:53 ?        00:00:00 [watchdog/28]
root        90     2  0 09:53 ?        00:00:00 [migration/29]
root        91     2  0 09:53 ?        00:00:00 [ksoftirqd/29]
root        92     2  0 09:53 ?        00:00:00 [watchdog/29]
root        93     2  0 09:53 ?        00:00:00 [migration/30]
root        94     2  0 09:53 ?        00:00:00 [ksoftirqd/30]
root        95     2  0 09:53 ?        00:00:00 [watchdog/30]
root        96     2  0 09:53 ?        00:00:00 [migration/31]
root        97     2  0 09:53 ?        00:00:00 [ksoftirqd/31]
root        98     2  0 09:53 ?        00:00:00 [watchdog/31]
root        99     2  0 09:53 ?        00:00:00 [migration/32]
root       100     2  0 09:53 ?        00:00:00 [ksoftirqd/32]
root       101     2  0 09:53 ?        00:00:00 [watchdog/32]
root       102     2  0 09:53 ?        00:00:00 [migration/33]
root       103     2  0 09:53 ?        00:00:00 [ksoftirqd/33]
root       104     2  0 09:53 ?        00:00:00 [watchdog/33]
root       105     2  0 09:53 ?        00:00:00 [migration/34]
root       106     2  0 09:53 ?        00:00:00 [ksoftirqd/34]
root       107     2  0 09:53 ?        00:00:00 [watchdog/34]
root       108     2  0 09:53 ?        00:00:00 [migration/35]
root       109     2  0 09:53 ?        00:00:00 [ksoftirqd/35]
root       110     2  0 09:53 ?        00:00:00 [watchdog/35]
root       111     2  0 09:53 ?        00:00:00 [migration/36]
root       112     2  0 09:53 ?        00:00:00 [ksoftirqd/36]
root       113     2  0 09:53 ?        00:00:00 [watchdog/36]
root       114     2  0 09:53 ?        00:00:00 [migration/37]
root       115     2  0 09:53 ?        00:00:00 [ksoftirqd/37]
root       116     2  0 09:53 ?        00:00:00 [watchdog/37]
root       117     2  0 09:53 ?        00:00:00 [migration/38]
root       118     2  0 09:53 ?        00:00:00 [ksoftirqd/38]
root       119     2  0 09:53 ?        00:00:00 [watchdog/38]
root       120     2  0 09:53 ?        00:00:00 [migration/39]
root       121     2  0 09:53 ?        00:00:00 [ksoftirqd/39]
root       122     2  0 09:53 ?        00:00:00 [watchdog/39]
root       123     2  0 09:53 ?        00:00:00 [migration/40]
root       124     2  0 09:53 ?        00:00:00 [ksoftirqd/40]
root       125     2  0 09:53 ?        00:00:00 [watchdog/40]
root       126     2  0 09:53 ?        00:00:00 [migration/41]
root       127     2  0 09:53 ?        00:00:00 [ksoftirqd/41]
root       128     2  0 09:53 ?        00:00:00 [watchdog/41]
root       129     2  0 09:53 ?        00:00:00 [migration/42]
root       130     2  0 09:53 ?        00:00:00 [ksoftirqd/42]
root       131     2  0 09:53 ?        00:00:00 [watchdog/42]
root       132     2  0 09:53 ?        00:00:00 [migration/43]
root       133     2  0 09:53 ?        00:00:00 [ksoftirqd/43]
root       134     2  0 09:53 ?        00:00:00 [watchdog/43]
root       135     2  0 09:53 ?        00:00:00 [migration/44]
root       136     2  0 09:53 ?        00:00:00 [ksoftirqd/44]
root       137     2  0 09:53 ?        00:00:00 [watchdog/44]
root       138     2  0 09:53 ?        00:00:00 [migration/45]
root       139     2  0 09:53 ?        00:00:00 [ksoftirqd/45]
root       140     2  0 09:53 ?        00:00:00 [watchdog/45]
root       141     2  0 09:53 ?        00:00:00 [migration/46]
root       142     2  0 09:53 ?        00:00:00 [ksoftirqd/46]
root       143     2  0 09:53 ?        00:00:00 [watchdog/46]
root       144     2  0 09:53 ?        00:00:00 [migration/47]
root       145     2  0 09:53 ?        00:00:00 [ksoftirqd/47]
root       146     2  0 09:53 ?        00:00:00 [watchdog/47]
root       147     2  0 09:53 ?        00:00:00 [migration/48]
root       148     2  0 09:53 ?        00:00:00 [ksoftirqd/48]
root       149     2  0 09:53 ?        00:00:00 [watchdog/48]
root       150     2  0 09:53 ?        00:00:00 [migration/49]
root       151     2  0 09:53 ?        00:00:00 [ksoftirqd/49]
root       152     2  0 09:53 ?        00:00:00 [watchdog/49]
root       153     2  0 09:53 ?        00:00:00 [migration/50]
root       154     2  0 09:53 ?        00:00:00 [ksoftirqd/50]
root       155     2  0 09:53 ?        00:00:00 [watchdog/50]
root       156     2  0 09:53 ?        00:00:00 [migration/51]
root       157     2  0 09:53 ?        00:00:00 [ksoftirqd/51]
root       158     2  0 09:53 ?        00:00:00 [watchdog/51]
root       159     2  0 09:53 ?        00:00:00 [migration/52]
root       160     2  0 09:53 ?        00:00:00 [ksoftirqd/52]
root       161     2  0 09:53 ?        00:00:00 [watchdog/52]
root       162     2  0 09:53 ?        00:00:00 [migration/53]
root       163     2  0 09:53 ?        00:00:00 [ksoftirqd/53]
root       164     2  0 09:53 ?        00:00:00 [watchdog/53]
root       165     2  0 09:53 ?        00:00:00 [migration/54]
root       166     2  0 09:53 ?        00:00:00 [ksoftirqd/54]
root       167     2  0 09:53 ?        00:00:00 [watchdog/54]
root       168     2  0 09:53 ?        00:00:00 [migration/55]
root       169     2  0 09:53 ?        00:00:00 [ksoftirqd/55]
root       170     2  0 09:53 ?        00:00:00 [watchdog/55]
root       171     2  0 09:53 ?        00:00:00 [migration/56]
root       172     2  0 09:53 ?        00:00:00 [ksoftirqd/56]
root       173     2  0 09:53 ?        00:00:00 [watchdog/56]
root       174     2  0 09:53 ?        00:00:00 [migration/57]
root       175     2  0 09:53 ?        00:00:00 [ksoftirqd/57]
root       176     2  0 09:53 ?        00:00:00 [watchdog/57]
root       177     2  0 09:53 ?        00:00:00 [migration/58]
root       178     2  0 09:53 ?        00:00:00 [ksoftirqd/58]
root       179     2  0 09:53 ?        00:00:00 [watchdog/58]
root       180     2  0 09:53 ?        00:00:00 [migration/59]
root       181     2  0 09:53 ?        00:00:00 [ksoftirqd/59]
root       182     2  0 09:53 ?        00:00:00 [watchdog/59]
root       183     2  0 09:53 ?        00:00:00 [migration/60]
root       184     2  0 09:53 ?        00:00:00 [ksoftirqd/60]
root       185     2  0 09:53 ?        00:00:00 [watchdog/60]
root       186     2  0 09:53 ?        00:00:00 [migration/61]
root       187     2  0 09:53 ?        00:00:00 [ksoftirqd/61]
root       188     2  0 09:53 ?        00:00:00 [watchdog/61]
root       189     2  0 09:53 ?        00:00:00 [migration/62]
root       190     2  0 09:53 ?        00:00:00 [ksoftirqd/62]
root       191     2  0 09:53 ?        00:00:00 [watchdog/62]
root       192     2  0 09:53 ?        00:00:00 [migration/63]
root       193     2  0 09:53 ?        00:00:00 [ksoftirqd/63]
root       194     2  0 09:53 ?        00:00:00 [watchdog/63]
root       195     2  0 09:53 ?        00:00:00 [events/0]
root       196     2  0 09:53 ?        00:00:00 [events/1]
root       197     2  0 09:53 ?        00:00:00 [events/2]
root       198     2  0 09:53 ?        00:00:00 [events/3]
root       199     2  0 09:53 ?        00:00:00 [events/4]
root       200     2  0 09:53 ?        00:00:00 [events/5]
root       201     2  0 09:53 ?        00:00:00 [events/6]
root       202     2  0 09:53 ?        00:00:00 [events/7]
root       203     2  0 09:53 ?        00:00:00 [events/8]
root       204     2  0 09:53 ?        00:00:00 [events/9]
root       205     2  0 09:53 ?        00:00:00 [events/10]
root       206     2  0 09:53 ?        00:00:00 [events/11]
root       207     2  0 09:53 ?        00:00:00 [events/12]
root       208     2  0 09:53 ?        00:00:00 [events/13]
root       209     2  0 09:53 ?        00:00:00 [events/14]
root       210     2  0 09:53 ?        00:00:00 [events/15]
root       211     2  0 09:53 ?        00:00:00 [events/16]
root       212     2  0 09:53 ?        00:00:00 [events/17]
root       213     2  0 09:53 ?        00:00:00 [events/18]
root       214     2  0 09:53 ?        00:00:00 [events/19]
root       215     2  0 09:53 ?        00:00:00 [events/20]
root       216     2  0 09:53 ?        00:00:00 [events/21]
root       217     2  0 09:53 ?        00:00:00 [events/22]
root       218     2  0 09:53 ?        00:00:00 [events/23]
root       219     2  0 09:53 ?        00:00:00 [events/24]
root       220     2  0 09:53 ?        00:00:00 [events/25]
root       221     2  0 09:53 ?        00:00:00 [events/26]
root       222     2  0 09:53 ?        00:00:00 [events/27]
root       223     2  0 09:53 ?        00:00:00 [events/28]
root       224     2  0 09:53 ?        00:00:00 [events/29]
root       225     2  0 09:53 ?        00:00:00 [events/30]
root       226     2  0 09:53 ?        00:00:00 [events/31]
root       227     2  0 09:53 ?        00:00:00 [events/32]
root       228     2  0 09:53 ?        00:00:00 [events/33]
root       229     2  0 09:53 ?        00:00:00 [events/34]
root       230     2  0 09:53 ?        00:00:00 [events/35]
root       231     2  0 09:53 ?        00:00:00 [events/36]
root       232     2  0 09:53 ?        00:00:00 [events/37]
root       233     2  0 09:53 ?        00:00:00 [events/38]
root       234     2  0 09:53 ?        00:00:00 [events/39]
root       235     2  0 09:53 ?        00:00:00 [events/40]
root       236     2  0 09:53 ?        00:00:00 [events/41]
root       237     2  0 09:53 ?        00:00:00 [events/42]
root       238     2  0 09:53 ?        00:00:00 [events/43]
root       239     2  0 09:53 ?        00:00:00 [events/44]
root       240     2  0 09:53 ?        00:00:00 [events/45]
root       241     2  0 09:53 ?        00:00:00 [events/46]
root       242     2  0 09:53 ?        00:00:00 [events/47]
root       243     2  0 09:53 ?        00:00:00 [events/48]
root       244     2  0 09:53 ?        00:00:00 [events/49]
root       245     2  0 09:53 ?        00:00:00 [events/50]
root       246     2  0 09:53 ?        00:00:00 [events/51]
root       247     2  0 09:53 ?        00:00:00 [events/52]
root       248     2  0 09:53 ?        00:00:00 [events/53]
root       249     2  0 09:53 ?        00:00:00 [events/54]
root       250     2  0 09:53 ?        00:00:00 [events/55]
root       251     2  0 09:53 ?        00:00:00 [events/56]
root       252     2  0 09:53 ?        00:00:00 [events/57]
root       253     2  0 09:53 ?        00:00:00 [events/58]
root       254     2  0 09:53 ?        00:00:00 [events/59]
root       255     2  0 09:53 ?        00:00:00 [events/60]
root       256     2  0 09:53 ?        00:00:00 [events/61]
root       257     2  0 09:53 ?        00:00:00 [events/62]
root       258     2  0 09:53 ?        00:00:00 [events/63]
root       259     2  0 09:53 ?        00:00:00 [khelper]
root       264     2  0 09:53 ?        00:00:00 [async/mgr]
root       432     2  0 09:53 ?        00:00:00 [sync_supers]
root       434     2  0 09:53 ?        00:00:00 [bdi-default]
root       435     2  0 09:53 ?        00:00:00 [kblockd/0]
root       436     2  0 09:53 ?        00:00:00 [kblockd/1]
root       437     2  0 09:53 ?        00:00:00 [kblockd/2]
root       438     2  0 09:53 ?        00:00:00 [kblockd/3]
root       439     2  0 09:53 ?        00:00:00 [kblockd/4]
root       440     2  0 09:53 ?        00:00:00 [kblockd/5]
root       441     2  0 09:53 ?        00:00:00 [kblockd/6]
root       442     2  0 09:53 ?        00:00:00 [kblockd/7]
root       443     2  0 09:53 ?        00:00:00 [kblockd/8]
root       444     2  0 09:53 ?        00:00:00 [kblockd/9]
root       445     2  0 09:53 ?        00:00:00 [kblockd/10]
root       446     2  0 09:53 ?        00:00:00 [kblockd/11]
root       447     2  0 09:53 ?        00:00:00 [kblockd/12]
root       448     2  0 09:53 ?        00:00:00 [kblockd/13]
root       449     2  0 09:53 ?        00:00:00 [kblockd/14]
root       450     2  0 09:53 ?        00:00:00 [kblockd/15]
root       451     2  0 09:53 ?        00:00:00 [kblockd/16]
root       452     2  0 09:53 ?        00:00:00 [kblockd/17]
root       453     2  0 09:53 ?        00:00:00 [kblockd/18]
root       454     2  0 09:53 ?        00:00:00 [kblockd/19]
root       455     2  0 09:53 ?        00:00:00 [kblockd/20]
root       456     2  0 09:53 ?        00:00:00 [kblockd/21]
root       457     2  0 09:53 ?        00:00:00 [kblockd/22]
root       458     2  0 09:53 ?        00:00:00 [kblockd/23]
root       459     2  0 09:53 ?        00:00:00 [kblockd/24]
root       460     2  0 09:53 ?        00:00:00 [kblockd/25]
root       461     2  0 09:53 ?        00:00:00 [kblockd/26]
root       462     2  0 09:53 ?        00:00:00 [kblockd/27]
root       463     2  0 09:53 ?        00:00:00 [kblockd/28]
root       464     2  0 09:53 ?        00:00:00 [kblockd/29]
root       465     2  0 09:53 ?        00:00:00 [kblockd/30]
root       466     2  0 09:53 ?        00:00:00 [kblockd/31]
root       467     2  0 09:53 ?        00:00:00 [kblockd/32]
root       468     2  0 09:53 ?        00:00:00 [kblockd/33]
root       469     2  0 09:53 ?        00:00:00 [kblockd/34]
root       470     2  0 09:53 ?        00:00:00 [kblockd/35]
root       471     2  0 09:53 ?        00:00:00 [kblockd/36]
root       472     2  0 09:53 ?        00:00:00 [kblockd/37]
root       473     2  0 09:53 ?        00:00:00 [kblockd/38]
root       474     2  0 09:53 ?        00:00:00 [kblockd/39]
root       475     2  0 09:53 ?        00:00:00 [kblockd/40]
root       476     2  0 09:53 ?        00:00:00 [kblockd/41]
root       477     2  0 09:53 ?        00:00:00 [kblockd/42]
root       478     2  0 09:53 ?        00:00:00 [kblockd/43]
root       479     2  0 09:53 ?        00:00:00 [kblockd/44]
root       480     2  0 09:53 ?        00:00:00 [kblockd/45]
root       481     2  0 09:53 ?        00:00:00 [kblockd/46]
root       482     2  0 09:53 ?        00:00:00 [kblockd/47]
root       483     2  0 09:53 ?        00:00:00 [kblockd/48]
root       484     2  0 09:53 ?        00:00:00 [kblockd/49]
root       485     2  0 09:53 ?        00:00:00 [kblockd/50]
root       486     2  0 09:53 ?        00:00:00 [kblockd/51]
root       487     2  0 09:53 ?        00:00:00 [kblockd/52]
root       488     2  0 09:53 ?        00:00:00 [kblockd/53]
root       489     2  0 09:53 ?        00:00:00 [kblockd/54]
root       490     2  0 09:53 ?        00:00:00 [kblockd/55]
root       491     2  0 09:53 ?        00:00:00 [kblockd/56]
root       492     2  0 09:53 ?        00:00:00 [kblockd/57]
root       493     2  0 09:53 ?        00:00:00 [kblockd/58]
root       494     2  0 09:53 ?        00:00:00 [kblockd/59]
root       495     2  0 09:53 ?        00:00:00 [kblockd/60]
root       496     2  0 09:53 ?        00:00:00 [kblockd/61]
root       497     2  0 09:53 ?        00:00:00 [kblockd/62]
root       498     2  0 09:53 ?        00:00:00 [kblockd/63]
root       500     2  0 09:53 ?        00:00:00 [kacpid]
root       501     2  0 09:53 ?        00:00:00 [kacpi_notify]
root       502     2  0 09:53 ?        00:00:00 [kacpi_hotplug]
root       720     2  0 09:53 ?        00:00:00 [ata/0]
root       721     2  0 09:53 ?        00:00:00 [ata_aux]
root       723     2  0 09:53 ?        00:00:00 [kseriod]
root       757     2  0 09:53 ?        00:00:00 [kondemand/0]
root      1287     2  0 09:53 ?        00:00:00 [khungtaskd]
root      1288     2  0 09:53 ?        00:00:00 [kswapd0]
root      1335     2  0 09:53 ?        00:00:00 [aio/0]
root      1349     2  0 09:53 ?        00:00:00 [nfsiod]
root      2154     2  0 09:53 ?        00:00:00 [scsi_eh_0]
root      2181     2  0 09:53 ?        00:00:00 [scsi_eh_1]
root      2184     2  0 09:53 ?        00:00:00 [scsi_eh_2]
root      2186     2  0 09:53 ?        00:00:00 [scsi_eh_3]
root      2188     2  0 09:53 ?        00:00:00 [scsi_eh_4]
root      2190     2  0 09:53 ?        00:00:00 [scsi_eh_5]
root      2192     2  0 09:53 ?        00:00:00 [scsi_eh_6]
root      2223     2  0 09:53 ?        00:00:00 [kpsmoused]
root      2227     2  0 09:53 ?        00:00:00 [rpciod/0]
root      2245     2  0 09:53 ?        00:00:00 [kondemand/1]
root      2246     2  0 09:53 ?        00:00:00 [kondemand/2]
root      2247     2  0 09:53 ?        00:00:00 [kondemand/4]
root      2278     2  0 09:53 ?        00:00:00 [kondemand/36]
root      2279     2  0 09:53 ?        00:00:00 [kondemand/39]
root      2301     2  0 09:53 ?        00:00:00 [kondemand/63]
root      2304     2  0 09:53 ?        00:00:00 [kondemand/43]
root      2308     2  0 09:53 ?        00:00:00 [kondemand/37]
root      2309     2  0 09:53 ?        00:00:00 [kondemand/32]
root      2310     2  0 09:53 ?        00:00:00 [kondemand/8]
root      2313     2  0 09:53 ?        00:00:00 [kjournald]
root      2314     2  0 09:53 ?        00:00:00 [kondemand/44]
root      2317     2  0 09:53 ?        00:00:00 [kondemand/41]
root      2322     2  0 09:53 ?        00:00:00 [kondemand/52]
root      2327     2  0 09:53 ?        00:00:00 [kondemand/60]
root      2329     2  0 09:53 ?        00:00:00 [kondemand/48]
root      2331     2  0 09:53 ?        00:00:00 [kondemand/56]
root      2352     2  0 09:53 ?        00:00:00 [kondemand/28]
root      2369     2  0 09:53 ?        00:00:00 [kondemand/16]
root      2383     1  1 09:53 ?        00:00:03 udevd --daemon
root      2392     2  0 09:53 ?        00:00:00 [kondemand/40]
root      2395     2  0 09:53 ?        00:00:00 [kondemand/45]
root      2398     2  0 09:53 ?        00:00:00 [kondemand/11]
root      2401     2  0 09:53 ?        00:00:00 [kondemand/33]
root      2427     2  0 09:53 ?        00:00:00 [kondemand/49]
root      2437     2  0 09:53 ?        00:00:00 [kondemand/47]
root      2442     2  0 09:53 ?        00:00:00 [kondemand/13]
root      2447     2  0 09:53 ?        00:00:00 [kondemand/51]
root      2451     2  0 09:53 ?        00:00:00 [kondemand/55]
root      2452     2  0 09:53 ?        00:00:00 [kondemand/59]
root      2474     2  0 09:53 ?        00:00:00 [kondemand/53]
root      2480     2  0 09:53 ?        00:00:00 [kondemand/57]
root      2515     2  0 09:53 ?        00:00:00 [kondemand/7]
root      2564     2  0 09:53 ?        00:00:00 [kondemand/61]
root      2577     2  0 09:53 ?        00:00:00 [kondemand/35]
root      3648     2  0 09:53 ?        00:00:00 [ksuspend_usbd]
root      3655     2  0 09:53 ?        00:00:00 [khubd]
root      3710     2  0 09:53 ?        00:00:00 [mpt_poll_0]
root      3711     2  0 09:53 ?        00:00:00 [mpt/0]
root      3873     2  0 09:53 ?        00:00:00 [kondemand/20]
root      3901     2  0 09:53 ?        00:00:00 [usbhid_resumer]
root      3931     2  0 09:53 ?        00:00:00 [kondemand/29]
root      3932     2  0 09:53 ?        00:00:00 [kondemand/5]
root      3987     2  0 09:53 ?        00:00:00 [kondemand/9]
root      4094     2  0 09:53 ?        00:00:00 [scsi_eh_7]
root      4109     2  0 09:53 ?        00:00:00 [kondemand/12]
root      4130     2  0 09:53 ?        00:00:00 [kondemand/17]
root      4132     2  0 09:53 ?        00:00:00 [kondemand/21]
root      4199     2  0 09:53 ?        00:00:00 [kondemand/25]
root      4459     2  0 09:54 ?        00:00:00 [kjournald]
root      4525     2  0 09:54 ?        00:00:00 [flush-8:0]
root      4543     1  0 09:54 ?        00:00:00 dhclient3 -pf /var/run/dhclient.eth0.pid -lf /var/lib/dhcp3/dhclient.eth0.leases eth0
daemon    4560     1  0 09:54 ?        00:00:00 /sbin/portmap
statd     4571     1  0 09:54 ?        00:00:00 /sbin/rpc.statd
root      4732     1  0 09:54 ?        00:00:00 /usr/sbin/rsyslogd -c3
root      4743     1  0 09:54 ?        00:00:00 /usr/sbin/acpid
root      4756     1  0 09:54 ?        00:00:00 /usr/sbin/sshd
101       5061     1  0 09:54 ?        00:00:00 /usr/sbin/exim4 -bd -q30m
daemon    5088     1  0 09:54 ?        00:00:00 /usr/sbin/atd
root      5108     1  0 09:54 ?        00:00:00 /usr/sbin/cron
root      5125     1  0 09:54 tty1     00:00:00 /sbin/getty 38400 tty1
root      5126     1  0 09:54 tty2     00:00:00 /sbin/getty 38400 tty2
root      5127     1  0 09:54 tty3     00:00:00 /sbin/getty 38400 tty3
root      5128     1  0 09:54 tty4     00:00:00 /sbin/getty 38400 tty4
root      5129     1  0 09:54 tty5     00:00:00 /sbin/getty 38400 tty5
root      5130     1  0 09:54 tty6     00:00:00 /sbin/getty 38400 tty6
root      5159     2  0 09:55 ?        00:00:00 [kondemand/38]
root      5182  4756  1 09:56 ?        00:00:00 sshd: axboe [priv]
axboe     5186  5182  0 09:56 ?        00:00:00 sshd: axboe@pts/0
axboe     5187  5186  0 09:56 pts/0    00:00:00 -bash
axboe     5190  5187  0 09:56 pts/0    00:00:00 ps -ef

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 0/6] Lazy workqueues
@ 2009-08-20 10:17 Jens Axboe
  2009-08-20 10:17 ` [PATCH 1/4] direct-io: unify argument passing by adding a dio_args structure Jens Axboe
  0 siblings, 1 reply; 30+ messages in thread
From: Jens Axboe @ 2009-08-20 10:17 UTC (permalink / raw)
  To: linux-kernel; +Cc: jeff, benh, htejun, bzolnier, alan

Hi,

After yesterdays rant on having too many kernel threads and checking
how many I actually have running on this system (531!), I decided to 
try and do something about it.

My goal was to retain the workqueue interface instead of coming up with
a new scheme that required conversion (or converting to slow_work which,
btw, is an awful name :-). I also wanted to retain the affinity
guarantees of workqueues as much as possible.

So this is a first step in that direction, it's probably full of races
and holes, but should get the idea across. It adds a
create_lazy_workqueue() helper, similar to the other variants that we
currently have. A lazy workqueue works like a normal workqueue, except
that it only (by default) starts a core thread instead of threads for
all online CPUs. When work is queued on a lazy workqueue for a CPU
that doesn't have a thread running, it will be placed on the core CPUs
list and that will then create and move the work to the right target.
Should task creation fail, the queued work will be executed on the
core CPU instead. Once a lazy workqueue thread has been idle for a
certain amount of time, it will again exit.

The patch boots here and I exercised the rpciod workqueue and
verified that it gets created, runs on the right CPU, and exits a while
later. So core functionality should be there, even if it has holes.

With this patchset, I am now down to 280 kernel threads on one of my test
boxes. Still too many, but it's a start and a net reduction of 251
threads here, or 47%!

The code can also be pulled from:

  git://git.kernel.dk/linux-2.6-block.git workqueue

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 1/4] direct-io: unify argument passing by adding a dio_args structure
  2009-08-20 10:17 Jens Axboe
@ 2009-08-20 10:17 ` Jens Axboe
  2009-08-20 10:17   ` [PATCH 1/6] workqueue: replace singlethread/freezable/rt parameters and variables with flags Jens Axboe
  0 siblings, 1 reply; 30+ messages in thread
From: Jens Axboe @ 2009-08-20 10:17 UTC (permalink / raw)
  To: linux-kernel; +Cc: jeff, benh, htejun, bzolnier, alan, Jens Axboe

The O_DIRECT IO path is a mess of arguments. Clean that up by passing
those arguments in a dedicated dio_args structure.

This is in preparation for changing the internal implementation to be
page based instead of using iovecs.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
 fs/block_dev.c              |    7 ++--
 fs/btrfs/inode.c            |    4 +--
 fs/direct-io.c              |   70 +++++++++++++++++++++++++++----------------
 fs/ext2/inode.c             |    8 ++---
 fs/ext3/inode.c             |   15 ++++-----
 fs/ext4/inode.c             |   15 ++++-----
 fs/fat/inode.c              |   12 +++----
 fs/gfs2/aops.c              |   11 ++----
 fs/hfs/inode.c              |    7 ++--
 fs/hfsplus/inode.c          |    8 ++--
 fs/jfs/inode.c              |    7 ++--
 fs/nfs/direct.c             |    9 ++----
 fs/nilfs2/inode.c           |    9 ++---
 fs/ocfs2/aops.c             |   11 ++-----
 fs/reiserfs/inode.c         |    7 +---
 fs/xfs/linux-2.6/xfs_aops.c |   19 ++++--------
 include/linux/fs.h          |   59 +++++++++++++++++++++---------------
 include/linux/nfs_fs.h      |    3 +-
 mm/filemap.c                |    8 +++--
 19 files changed, 141 insertions(+), 148 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 94dfda2..2e494fa 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -166,14 +166,13 @@ blkdev_get_blocks(struct inode *inode, sector_t iblock,
 }
 
 static ssize_t
-blkdev_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
-			loff_t offset, unsigned long nr_segs)
+blkdev_direct_IO(struct kiocb *iocb, struct dio_args *args)
 {
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = file->f_mapping->host;
 
-	return blockdev_direct_IO_no_locking(rw, iocb, inode, I_BDEV(inode),
-				iov, offset, nr_segs, blkdev_get_blocks, NULL);
+	return blockdev_direct_IO_no_locking(iocb, inode, I_BDEV(inode),
+				args, blkdev_get_blocks, NULL);
 }
 
 int __sync_blockdev(struct block_device *bdev, int wait)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 272b9b2..094e3a7 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4308,9 +4308,7 @@ out:
 	return em;
 }
 
-static ssize_t btrfs_direct_IO(int rw, struct kiocb *iocb,
-			const struct iovec *iov, loff_t offset,
-			unsigned long nr_segs)
+static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct dio_args *args)
 {
 	return -EINVAL;
 }
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 8b10b87..181848c 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -929,14 +929,14 @@ out:
  * Releases both i_mutex and i_alloc_sem
  */
 static ssize_t
-direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode, 
-	const struct iovec *iov, loff_t offset, unsigned long nr_segs, 
-	unsigned blkbits, get_block_t get_block, dio_iodone_t end_io,
-	struct dio *dio)
+direct_io_worker(struct kiocb *iocb, struct inode *inode, 
+	struct dio_args *args, unsigned blkbits, get_block_t get_block,
+	dio_iodone_t end_io, struct dio *dio)
 {
-	unsigned long user_addr; 
+	const struct iovec *iov = args->iov;
+	unsigned long user_addr;
 	unsigned long flags;
-	int seg;
+	int seg, rw = args->rw;
 	ssize_t ret = 0;
 	ssize_t ret2;
 	size_t bytes;
@@ -945,7 +945,7 @@ direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode,
 	dio->rw = rw;
 	dio->blkbits = blkbits;
 	dio->blkfactor = inode->i_blkbits - blkbits;
-	dio->block_in_file = offset >> blkbits;
+	dio->block_in_file = args->offset >> blkbits;
 
 	dio->get_block = get_block;
 	dio->end_io = end_io;
@@ -965,14 +965,14 @@ direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode,
 	if (unlikely(dio->blkfactor))
 		dio->pages_in_io = 2;
 
-	for (seg = 0; seg < nr_segs; seg++) {
-		user_addr = (unsigned long)iov[seg].iov_base;
+	for (seg = 0; seg < args->nr_segs; seg++) {
+		user_addr = (unsigned long) iov[seg].iov_base;
 		dio->pages_in_io +=
 			((user_addr+iov[seg].iov_len +PAGE_SIZE-1)/PAGE_SIZE
 				- user_addr/PAGE_SIZE);
 	}
 
-	for (seg = 0; seg < nr_segs; seg++) {
+	for (seg = 0; seg < args->nr_segs; seg++) {
 		user_addr = (unsigned long)iov[seg].iov_base;
 		dio->size += bytes = iov[seg].iov_len;
 
@@ -1076,7 +1076,7 @@ direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode,
 	spin_unlock_irqrestore(&dio->bio_lock, flags);
 
 	if (ret2 == 0) {
-		ret = dio_complete(dio, offset, ret);
+		ret = dio_complete(dio, args->offset, ret);
 		kfree(dio);
 	} else
 		BUG_ON(ret != -EIOCBQUEUED);
@@ -1106,10 +1106,9 @@ direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode,
  * Additional i_alloc_sem locking requirements described inline below.
  */
 ssize_t
-__blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
-	struct block_device *bdev, const struct iovec *iov, loff_t offset, 
-	unsigned long nr_segs, get_block_t get_block, dio_iodone_t end_io,
-	int dio_lock_type)
+__blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
+	struct block_device *bdev, struct dio_args *args, get_block_t get_block,
+	dio_iodone_t end_io, int dio_lock_type)
 {
 	int seg;
 	size_t size;
@@ -1118,10 +1117,11 @@ __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
 	unsigned bdev_blkbits = 0;
 	unsigned blocksize_mask = (1 << blkbits) - 1;
 	ssize_t retval = -EINVAL;
-	loff_t end = offset;
+	loff_t end = args->offset;
 	struct dio *dio;
 	int release_i_mutex = 0;
 	int acquire_i_mutex = 0;
+	int rw = args->rw;
 
 	if (rw & WRITE)
 		rw = WRITE_ODIRECT;
@@ -1129,18 +1129,18 @@ __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
 	if (bdev)
 		bdev_blkbits = blksize_bits(bdev_logical_block_size(bdev));
 
-	if (offset & blocksize_mask) {
+	if (args->offset & blocksize_mask) {
 		if (bdev)
 			 blkbits = bdev_blkbits;
 		blocksize_mask = (1 << blkbits) - 1;
-		if (offset & blocksize_mask)
+		if (args->offset & blocksize_mask)
 			goto out;
 	}
 
 	/* Check the memory alignment.  Blocks cannot straddle pages */
-	for (seg = 0; seg < nr_segs; seg++) {
-		addr = (unsigned long)iov[seg].iov_base;
-		size = iov[seg].iov_len;
+	for (seg = 0; seg < args->nr_segs; seg++) {
+		addr = (unsigned long) args->iov[seg].iov_base;
+		size = args->iov[seg].iov_len;
 		end += size;
 		if ((addr & blocksize_mask) || (size & blocksize_mask))  {
 			if (bdev)
@@ -1168,7 +1168,7 @@ __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
 	dio->lock_type = dio_lock_type;
 	if (dio_lock_type != DIO_NO_LOCKING) {
 		/* watch out for a 0 len io from a tricksy fs */
-		if (rw == READ && end > offset) {
+		if (rw == READ && end > args->offset) {
 			struct address_space *mapping;
 
 			mapping = iocb->ki_filp->f_mapping;
@@ -1177,8 +1177,8 @@ __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
 				release_i_mutex = 1;
 			}
 
-			retval = filemap_write_and_wait_range(mapping, offset,
-							      end - 1);
+			retval = filemap_write_and_wait_range(mapping,
+							args->offset, end - 1);
 			if (retval) {
 				kfree(dio);
 				goto out;
@@ -1204,8 +1204,8 @@ __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
 	dio->is_async = !is_sync_kiocb(iocb) && !((rw & WRITE) &&
 		(end > i_size_read(inode)));
 
-	retval = direct_io_worker(rw, iocb, inode, iov, offset,
-				nr_segs, blkbits, get_block, end_io, dio);
+	retval = direct_io_worker(iocb, inode, args, blkbits, get_block, end_io,
+					dio);
 
 	/*
 	 * In case of error extending write may have instantiated a few
@@ -1231,3 +1231,21 @@ out:
 	return retval;
 }
 EXPORT_SYMBOL(__blockdev_direct_IO);
+
+ssize_t generic_file_direct_IO(int rw, struct address_space *mapping,
+			       struct kiocb *iocb, const struct iovec *iov,
+			       loff_t offset, unsigned long nr_segs)
+{
+	struct dio_args args = {
+		.rw		= rw,
+		.iov		= iov,
+		.length		= iov_length(iov, nr_segs),
+		.offset		= offset,
+		.nr_segs	= nr_segs,
+	};
+
+	if (mapping->a_ops->direct_IO)
+		return mapping->a_ops->direct_IO(iocb, &args);
+
+	return -EINVAL;
+}
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index e271303..e813df7 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -790,15 +790,13 @@ static sector_t ext2_bmap(struct address_space *mapping, sector_t block)
 	return generic_block_bmap(mapping,block,ext2_get_block);
 }
 
-static ssize_t
-ext2_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
-			loff_t offset, unsigned long nr_segs)
+static ssize_t ext2_direct_IO(struct kiocb *iocb, struct dio_args *args)
 {
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = file->f_mapping->host;
 
-	return blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov,
-				offset, nr_segs, ext2_get_block, NULL);
+	return blockdev_direct_IO(iocb, inode, inode->i_sb->s_bdev, args,
+					ext2_get_block, NULL);
 }
 
 static int
diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index b49908a..11dc0d1 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -1713,9 +1713,7 @@ static int ext3_releasepage(struct page *page, gfp_t wait)
  * crashes then stale disk data _may_ be exposed inside the file. But current
  * VFS code falls back into buffered path in that case so we are safe.
  */
-static ssize_t ext3_direct_IO(int rw, struct kiocb *iocb,
-			const struct iovec *iov, loff_t offset,
-			unsigned long nr_segs)
+static ssize_t ext3_direct_IO(struct kiocb *iocb, struct dio_args *args)
 {
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = file->f_mapping->host;
@@ -1723,10 +1721,10 @@ static ssize_t ext3_direct_IO(int rw, struct kiocb *iocb,
 	handle_t *handle;
 	ssize_t ret;
 	int orphan = 0;
-	size_t count = iov_length(iov, nr_segs);
+	size_t count = args->length;
 
-	if (rw == WRITE) {
-		loff_t final_size = offset + count;
+	if (args->rw == WRITE) {
+		loff_t final_size = args->offset + count;
 
 		if (final_size > inode->i_size) {
 			/* Credits for sb + inode write */
@@ -1746,8 +1744,7 @@ static ssize_t ext3_direct_IO(int rw, struct kiocb *iocb,
 		}
 	}
 
-	ret = blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov,
-				 offset, nr_segs,
+	ret = blockdev_direct_IO(iocb, inode, inode->i_sb->s_bdev, args,
 				 ext3_get_block, NULL);
 
 	if (orphan) {
@@ -1765,7 +1762,7 @@ static ssize_t ext3_direct_IO(int rw, struct kiocb *iocb,
 		if (inode->i_nlink)
 			ext3_orphan_del(handle, inode);
 		if (ret > 0) {
-			loff_t end = offset + ret;
+			loff_t end = args->offset + ret;
 			if (end > inode->i_size) {
 				ei->i_disksize = end;
 				i_size_write(inode, end);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index f9c642b..164fdb3 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3267,9 +3267,7 @@ static int ext4_releasepage(struct page *page, gfp_t wait)
  * crashes then stale disk data _may_ be exposed inside the file. But current
  * VFS code falls back into buffered path in that case so we are safe.
  */
-static ssize_t ext4_direct_IO(int rw, struct kiocb *iocb,
-			      const struct iovec *iov, loff_t offset,
-			      unsigned long nr_segs)
+static ssize_t ext4_direct_IO(struct kiocb *iocb, struct dio_args *args)
 {
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = file->f_mapping->host;
@@ -3277,10 +3275,10 @@ static ssize_t ext4_direct_IO(int rw, struct kiocb *iocb,
 	handle_t *handle;
 	ssize_t ret;
 	int orphan = 0;
-	size_t count = iov_length(iov, nr_segs);
+	size_t count = args->length;
 
-	if (rw == WRITE) {
-		loff_t final_size = offset + count;
+	if (args->rw == WRITE) {
+		loff_t final_size = args->offset + count;
 
 		if (final_size > inode->i_size) {
 			/* Credits for sb + inode write */
@@ -3300,8 +3298,7 @@ static ssize_t ext4_direct_IO(int rw, struct kiocb *iocb,
 		}
 	}
 
-	ret = blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov,
-				 offset, nr_segs,
+	ret = blockdev_direct_IO(iocb, inode, inode->i_sb->s_bdev, args,
 				 ext4_get_block, NULL);
 
 	if (orphan) {
@@ -3319,7 +3316,7 @@ static ssize_t ext4_direct_IO(int rw, struct kiocb *iocb,
 		if (inode->i_nlink)
 			ext4_orphan_del(handle, inode);
 		if (ret > 0) {
-			loff_t end = offset + ret;
+			loff_t end = args->offset + ret;
 			if (end > inode->i_size) {
 				ei->i_disksize = end;
 				i_size_write(inode, end);
diff --git a/fs/fat/inode.c b/fs/fat/inode.c
index 8970d8c..9d41851 100644
--- a/fs/fat/inode.c
+++ b/fs/fat/inode.c
@@ -167,14 +167,12 @@ static int fat_write_end(struct file *file, struct address_space *mapping,
 	return err;
 }
 
-static ssize_t fat_direct_IO(int rw, struct kiocb *iocb,
-			     const struct iovec *iov,
-			     loff_t offset, unsigned long nr_segs)
+static ssize_t fat_direct_IO(struct kiocb *iocb, struct dio_args *args)
 {
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = file->f_mapping->host;
 
-	if (rw == WRITE) {
+	if (args->rw == WRITE) {
 		/*
 		 * FIXME: blockdev_direct_IO() doesn't use ->write_begin(),
 		 * so we need to update the ->mmu_private to block boundary.
@@ -184,7 +182,7 @@ static ssize_t fat_direct_IO(int rw, struct kiocb *iocb,
 		 *
 		 * Return 0, and fallback to normal buffered write.
 		 */
-		loff_t size = offset + iov_length(iov, nr_segs);
+		loff_t size = args->offset + args->length;
 		if (MSDOS_I(inode)->mmu_private < size)
 			return 0;
 	}
@@ -193,8 +191,8 @@ static ssize_t fat_direct_IO(int rw, struct kiocb *iocb,
 	 * FAT need to use the DIO_LOCKING for avoiding the race
 	 * condition of fat_get_block() and ->truncate().
 	 */
-	return blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov,
-				  offset, nr_segs, fat_get_block, NULL);
+	return blockdev_direct_IO(iocb, inode, inode->i_sb->s_bdev, args,
+				  fat_get_block, NULL);
 }
 
 static sector_t _fat_bmap(struct address_space *mapping, sector_t block)
diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index 7ebae9a..a9422a2 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -1021,9 +1021,7 @@ static int gfs2_ok_for_dio(struct gfs2_inode *ip, int rw, loff_t offset)
 
 
 
-static ssize_t gfs2_direct_IO(int rw, struct kiocb *iocb,
-			      const struct iovec *iov, loff_t offset,
-			      unsigned long nr_segs)
+static ssize_t gfs2_direct_IO(struct kiocb *iocb, struct dio_args *args)
 {
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = file->f_mapping->host;
@@ -1043,13 +1041,12 @@ static ssize_t gfs2_direct_IO(int rw, struct kiocb *iocb,
 	rv = gfs2_glock_nq(&gh);
 	if (rv)
 		return rv;
-	rv = gfs2_ok_for_dio(ip, rw, offset);
+	rv = gfs2_ok_for_dio(ip, args->rw, args->offset);
 	if (rv != 1)
 		goto out; /* dio not valid, fall back to buffered i/o */
 
-	rv = blockdev_direct_IO_no_locking(rw, iocb, inode, inode->i_sb->s_bdev,
-					   iov, offset, nr_segs,
-					   gfs2_get_block_direct, NULL);
+	rv = blockdev_direct_IO_no_locking(iocb, inode, inode->i_sb->s_bdev,
+					   args, gfs2_get_block_direct, NULL);
 out:
 	gfs2_glock_dq_m(1, &gh);
 	gfs2_holder_uninit(&gh);
diff --git a/fs/hfs/inode.c b/fs/hfs/inode.c
index a1cbff2..2998914 100644
--- a/fs/hfs/inode.c
+++ b/fs/hfs/inode.c
@@ -107,14 +107,13 @@ static int hfs_releasepage(struct page *page, gfp_t mask)
 	return res ? try_to_free_buffers(page) : 0;
 }
 
-static ssize_t hfs_direct_IO(int rw, struct kiocb *iocb,
-		const struct iovec *iov, loff_t offset, unsigned long nr_segs)
+static ssize_t hfs_direct_IO(struct kiocb *iocb, struct dio_args *args)
 {
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = file->f_path.dentry->d_inode->i_mapping->host;
 
-	return blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov,
-				  offset, nr_segs, hfs_get_block, NULL);
+	return blockdev_direct_IO(iocb, inode, inode->i_sb->s_bdev, args,
+				  hfs_get_block, NULL);
 }
 
 static int hfs_writepages(struct address_space *mapping,
diff --git a/fs/hfsplus/inode.c b/fs/hfsplus/inode.c
index 1bcf597..dd7102b 100644
--- a/fs/hfsplus/inode.c
+++ b/fs/hfsplus/inode.c
@@ -100,14 +100,14 @@ static int hfsplus_releasepage(struct page *page, gfp_t mask)
 	return res ? try_to_free_buffers(page) : 0;
 }
 
-static ssize_t hfsplus_direct_IO(int rw, struct kiocb *iocb,
-		const struct iovec *iov, loff_t offset, unsigned long nr_segs)
+static ssize_t hfsplus_direct_IO(struct kiocb *iocb,
+				 struct dio_args *args)
 {
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = file->f_path.dentry->d_inode->i_mapping->host;
 
-	return blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov,
-				  offset, nr_segs, hfsplus_get_block, NULL);
+	return blockdev_direct_IO(iocb, inode, inode->i_sb->s_bdev, args,
+				  hfsplus_get_block, NULL);
 }
 
 static int hfsplus_writepages(struct address_space *mapping,
diff --git a/fs/jfs/inode.c b/fs/jfs/inode.c
index b2ae190..e1420de 100644
--- a/fs/jfs/inode.c
+++ b/fs/jfs/inode.c
@@ -306,14 +306,13 @@ static sector_t jfs_bmap(struct address_space *mapping, sector_t block)
 	return generic_block_bmap(mapping, block, jfs_get_block);
 }
 
-static ssize_t jfs_direct_IO(int rw, struct kiocb *iocb,
-	const struct iovec *iov, loff_t offset, unsigned long nr_segs)
+static ssize_t jfs_direct_IO(struct kiocb *iocb, struct dio_args *args)
 {
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = file->f_mapping->host;
 
-	return blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov,
-				offset, nr_segs, jfs_get_block, NULL);
+	return blockdev_direct_IO(iocb, inode, inode->i_sb->s_bdev, args,
+					jfs_get_block, NULL);
 }
 
 const struct address_space_operations jfs_aops = {
diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index e4e089a..45d931b 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -103,21 +103,18 @@ static inline int put_dreq(struct nfs_direct_req *dreq)
 /**
  * nfs_direct_IO - NFS address space operation for direct I/O
  * @rw: direction (read or write)
- * @iocb: target I/O control block
- * @iov: array of vectors that define I/O buffer
- * @pos: offset in file to begin the operation
- * @nr_segs: size of iovec array
+ * @args: IO arguments
  *
  * The presence of this routine in the address space ops vector means
  * the NFS client supports direct I/O.  However, we shunt off direct
  * read and write requests before the VFS gets them, so this method
  * should never be called.
  */
-ssize_t nfs_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov, loff_t pos, unsigned long nr_segs)
+ssize_t nfs_direct_IO(struct kiocb *iocb, struct dio_args *args)
 {
 	dprintk("NFS: nfs_direct_IO (%s) off/no(%Ld/%lu) EINVAL\n",
 			iocb->ki_filp->f_path.dentry->d_name.name,
-			(long long) pos, nr_segs);
+			(long long) args->offset, args->nr_segs);
 
 	return -EINVAL;
 }
diff --git a/fs/nilfs2/inode.c b/fs/nilfs2/inode.c
index fe9d8f2..840c307 100644
--- a/fs/nilfs2/inode.c
+++ b/fs/nilfs2/inode.c
@@ -222,19 +222,18 @@ static int nilfs_write_end(struct file *file, struct address_space *mapping,
 }
 
 static ssize_t
-nilfs_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
-		loff_t offset, unsigned long nr_segs)
+nilfs_direct_IO(struct kiocb *iocb, struct dio_args *args)
 {
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = file->f_mapping->host;
 	ssize_t size;
 
-	if (rw == WRITE)
+	if (args->rw == WRITE)
 		return 0;
 
 	/* Needs synchronization with the cleaner */
-	size = blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov,
-				  offset, nr_segs, nilfs_get_block, NULL);
+	size = blockdev_direct_IO(iocb, inode, inode->i_sb->s_bdev, args,
+				  nilfs_get_block, NULL);
 	return size;
 }
 
diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
index b401654..56e61ba 100644
--- a/fs/ocfs2/aops.c
+++ b/fs/ocfs2/aops.c
@@ -668,11 +668,7 @@ static int ocfs2_releasepage(struct page *page, gfp_t wait)
 	return jbd2_journal_try_to_free_buffers(journal, page, wait);
 }
 
-static ssize_t ocfs2_direct_IO(int rw,
-			       struct kiocb *iocb,
-			       const struct iovec *iov,
-			       loff_t offset,
-			       unsigned long nr_segs)
+static ssize_t ocfs2_direct_IO(struct kiocb *iocb, struct dio_args *args)
 {
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = file->f_path.dentry->d_inode->i_mapping->host;
@@ -687,9 +683,8 @@ static ssize_t ocfs2_direct_IO(int rw,
 	if (OCFS2_I(inode)->ip_dyn_features & OCFS2_INLINE_DATA_FL)
 		return 0;
 
-	ret = blockdev_direct_IO_no_locking(rw, iocb, inode,
-					    inode->i_sb->s_bdev, iov, offset,
-					    nr_segs, 
+	ret = blockdev_direct_IO_no_locking(iocb, inode,
+					    inode->i_sb->s_bdev, args,
 					    ocfs2_direct_IO_get_blocks,
 					    ocfs2_dio_end_io);
 
diff --git a/fs/reiserfs/inode.c b/fs/reiserfs/inode.c
index a14d6cd..201e6ca 100644
--- a/fs/reiserfs/inode.c
+++ b/fs/reiserfs/inode.c
@@ -3025,15 +3025,12 @@ static int reiserfs_releasepage(struct page *page, gfp_t unused_gfp_flags)
 
 /* We thank Mingming Cao for helping us understand in great detail what
    to do in this section of the code. */
-static ssize_t reiserfs_direct_IO(int rw, struct kiocb *iocb,
-				  const struct iovec *iov, loff_t offset,
-				  unsigned long nr_segs)
+static ssize_t reiserfs_direct_IO(struct kiocb *iocb, struct dio_args *args)
 {
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = file->f_mapping->host;
 
-	return blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov,
-				  offset, nr_segs,
+	return blockdev_direct_IO(iocb, inode, inode->i_sb->s_bdev, args,
 				  reiserfs_get_blocks_direct_io, NULL);
 }
 
diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
index aecf251..0faf1fe 100644
--- a/fs/xfs/linux-2.6/xfs_aops.c
+++ b/fs/xfs/linux-2.6/xfs_aops.c
@@ -1532,11 +1532,8 @@ xfs_end_io_direct(
 
 STATIC ssize_t
 xfs_vm_direct_IO(
-	int			rw,
 	struct kiocb		*iocb,
-	const struct iovec	*iov,
-	loff_t			offset,
-	unsigned long		nr_segs)
+	struct dio_args		*args)
 {
 	struct file	*file = iocb->ki_filp;
 	struct inode	*inode = file->f_mapping->host;
@@ -1545,18 +1542,14 @@ xfs_vm_direct_IO(
 
 	bdev = xfs_find_bdev_for_inode(XFS_I(inode));
 
-	if (rw == WRITE) {
+	if (args->rw == WRITE) {
 		iocb->private = xfs_alloc_ioend(inode, IOMAP_UNWRITTEN);
-		ret = blockdev_direct_IO_own_locking(rw, iocb, inode,
-			bdev, iov, offset, nr_segs,
-			xfs_get_blocks_direct,
-			xfs_end_io_direct);
+		ret = blockdev_direct_IO_own_locking(iocb, inode, bdev, args,
+				xfs_get_blocks_direct, xfs_end_io_direct);
 	} else {
 		iocb->private = xfs_alloc_ioend(inode, IOMAP_READ);
-		ret = blockdev_direct_IO_no_locking(rw, iocb, inode,
-			bdev, iov, offset, nr_segs,
-			xfs_get_blocks_direct,
-			xfs_end_io_direct);
+		ret = blockdev_direct_IO_no_locking(iocb, inode,
+			bdev, args, xfs_get_blocks_direct, xfs_end_io_direct);
 	}
 
 	if (unlikely(ret != -EIOCBQUEUED && iocb->private))
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 67888a9..5971116 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -560,6 +560,7 @@ typedef struct {
 typedef int (*read_actor_t)(read_descriptor_t *, struct page *,
 		unsigned long, unsigned long);
 
+struct dio_args;
 struct address_space_operations {
 	int (*writepage)(struct page *page, struct writeback_control *wbc);
 	int (*readpage)(struct file *, struct page *);
@@ -585,8 +586,7 @@ struct address_space_operations {
 	sector_t (*bmap)(struct address_space *, sector_t);
 	void (*invalidatepage) (struct page *, unsigned long);
 	int (*releasepage) (struct page *, gfp_t);
-	ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
-			loff_t offset, unsigned long nr_segs);
+	ssize_t (*direct_IO)(struct kiocb *, struct dio_args *);
 	int (*get_xip_mem)(struct address_space *, pgoff_t, int,
 						void **, unsigned long *);
 	/* migrate the contents of a page to the specified target */
@@ -2241,10 +2241,24 @@ static inline int xip_truncate_page(struct address_space *mapping, loff_t from)
 #endif
 
 #ifdef CONFIG_BLOCK
-ssize_t __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
-	struct block_device *bdev, const struct iovec *iov, loff_t offset,
-	unsigned long nr_segs, get_block_t get_block, dio_iodone_t end_io,
-	int lock_type);
+
+/*
+ * Arguments passwed to aops->direct_IO()
+ */
+struct dio_args {
+	int rw;
+	const struct iovec *iov;
+	unsigned long length;
+	loff_t offset;
+	unsigned long nr_segs;
+};
+
+ssize_t __blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
+	struct block_device *bdev, struct dio_args *args, get_block_t get_block,
+	dio_iodone_t end_io, int lock_type);
+
+ssize_t generic_file_direct_IO(int, struct address_space *, struct kiocb *,
+				const struct iovec *, loff_t, unsigned long);
 
 enum {
 	DIO_LOCKING = 1, /* need locking between buffered and direct access */
@@ -2252,31 +2266,28 @@ enum {
 	DIO_OWN_LOCKING, /* filesystem locks buffered and direct internally */
 };
 
-static inline ssize_t blockdev_direct_IO(int rw, struct kiocb *iocb,
-	struct inode *inode, struct block_device *bdev, const struct iovec *iov,
-	loff_t offset, unsigned long nr_segs, get_block_t get_block,
-	dio_iodone_t end_io)
+static inline ssize_t blockdev_direct_IO(struct kiocb *iocb,
+	struct inode *inode, struct block_device *bdev, struct dio_args *args,
+	get_block_t get_block, dio_iodone_t end_io)
 {
-	return __blockdev_direct_IO(rw, iocb, inode, bdev, iov, offset,
-				nr_segs, get_block, end_io, DIO_LOCKING);
+	return __blockdev_direct_IO(iocb, inode, bdev, args,
+					get_block, end_io, DIO_LOCKING);
 }
 
-static inline ssize_t blockdev_direct_IO_no_locking(int rw, struct kiocb *iocb,
-	struct inode *inode, struct block_device *bdev, const struct iovec *iov,
-	loff_t offset, unsigned long nr_segs, get_block_t get_block,
-	dio_iodone_t end_io)
+static inline ssize_t blockdev_direct_IO_no_locking(struct kiocb *iocb,
+	struct inode *inode, struct block_device *bdev, struct dio_args *args,
+	get_block_t get_block, dio_iodone_t end_io)
 {
-	return __blockdev_direct_IO(rw, iocb, inode, bdev, iov, offset,
-				nr_segs, get_block, end_io, DIO_NO_LOCKING);
+	return __blockdev_direct_IO(iocb, inode, bdev, args,
+					get_block, end_io, DIO_NO_LOCKING);
 }
 
-static inline ssize_t blockdev_direct_IO_own_locking(int rw, struct kiocb *iocb,
-	struct inode *inode, struct block_device *bdev, const struct iovec *iov,
-	loff_t offset, unsigned long nr_segs, get_block_t get_block,
-	dio_iodone_t end_io)
+static inline ssize_t blockdev_direct_IO_own_locking(struct kiocb *iocb,
+	struct inode *inode, struct block_device *bdev, struct dio_args *args,
+	get_block_t get_block, dio_iodone_t end_io)
 {
-	return __blockdev_direct_IO(rw, iocb, inode, bdev, iov, offset,
-				nr_segs, get_block, end_io, DIO_OWN_LOCKING);
+	return __blockdev_direct_IO(iocb, inode, bdev, args,
+					get_block, end_io, DIO_OWN_LOCKING);
 }
 #endif
 
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index f6b9024..97a2383 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -408,8 +408,7 @@ extern int nfs3_removexattr (struct dentry *, const char *name);
 /*
  * linux/fs/nfs/direct.c
  */
-extern ssize_t nfs_direct_IO(int, struct kiocb *, const struct iovec *, loff_t,
-			unsigned long);
+extern ssize_t nfs_direct_IO(struct kiocb *, struct dio_args *);
 extern ssize_t nfs_file_direct_read(struct kiocb *iocb,
 			const struct iovec *iov, unsigned long nr_segs,
 			loff_t pos);
diff --git a/mm/filemap.c b/mm/filemap.c
index ccea3b6..cf85298 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1345,8 +1345,9 @@ generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov,
 			retval = filemap_write_and_wait_range(mapping, pos,
 					pos + iov_length(iov, nr_segs) - 1);
 			if (!retval) {
-				retval = mapping->a_ops->direct_IO(READ, iocb,
-							iov, pos, nr_segs);
+				retval = generic_file_direct_IO(READ, mapping,
+								iocb, iov,
+								pos, nr_segs);
 			}
 			if (retval > 0)
 				*ppos = pos + retval;
@@ -2144,7 +2145,8 @@ generic_file_direct_write(struct kiocb *iocb, const struct iovec *iov,
 		}
 	}
 
-	written = mapping->a_ops->direct_IO(WRITE, iocb, iov, pos, *nr_segs);
+	written = generic_file_direct_IO(WRITE, mapping, iocb, iov, pos,
+						*nr_segs);
 
 	/*
 	 * Finally, try again to invalidate clean pages which might have been
-- 
1.6.4.53.g3f55e


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 1/6] workqueue: replace singlethread/freezable/rt parameters and variables with flags
  2009-08-20 10:17 ` [PATCH 1/4] direct-io: unify argument passing by adding a dio_args structure Jens Axboe
@ 2009-08-20 10:17   ` Jens Axboe
  2009-08-20 10:17     ` [PATCH 2/4] direct-io: make O_DIRECT IO path be page based Jens Axboe
  0 siblings, 1 reply; 30+ messages in thread
From: Jens Axboe @ 2009-08-20 10:17 UTC (permalink / raw)
  To: linux-kernel; +Cc: jeff, benh, htejun, bzolnier, alan, Jens Axboe

Collapse the three ints into a flags variable, in preparation for
adding another flag.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
 include/linux/workqueue.h |   32 ++++++++++++++++++--------------
 kernel/workqueue.c        |   22 ++++++++--------------
 2 files changed, 26 insertions(+), 28 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 13e1adf..f14e20e 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -165,12 +165,17 @@ struct execute_work {
 
 
 extern struct workqueue_struct *
-__create_workqueue_key(const char *name, int singlethread,
-		       int freezeable, int rt, struct lock_class_key *key,
-		       const char *lock_name);
+__create_workqueue_key(const char *name, unsigned int flags,
+		       struct lock_class_key *key, const char *lock_name);
+
+enum {
+	WQ_F_SINGLETHREAD	= 1,
+	WQ_F_FREEZABLE		= 2,
+	WQ_F_RT			= 4,
+};
 
 #ifdef CONFIG_LOCKDEP
-#define __create_workqueue(name, singlethread, freezeable, rt)	\
+#define __create_workqueue(name, flags)				\
 ({								\
 	static struct lock_class_key __key;			\
 	const char *__lock_name;				\
@@ -180,20 +185,19 @@ __create_workqueue_key(const char *name, int singlethread,
 	else							\
 		__lock_name = #name;				\
 								\
-	__create_workqueue_key((name), (singlethread),		\
-			       (freezeable), (rt), &__key,	\
-			       __lock_name);			\
+	__create_workqueue_key((name), (flags), &__key, __lockname);	\
 })
 #else
-#define __create_workqueue(name, singlethread, freezeable, rt)	\
-	__create_workqueue_key((name), (singlethread), (freezeable), (rt), \
-			       NULL, NULL)
+#define __create_workqueue(name, flags)				\
+	__create_workqueue_key((name), (flags), NULL, NULL)	
 #endif
 
-#define create_workqueue(name) __create_workqueue((name), 0, 0, 0)
-#define create_rt_workqueue(name) __create_workqueue((name), 0, 0, 1)
-#define create_freezeable_workqueue(name) __create_workqueue((name), 1, 1, 0)
-#define create_singlethread_workqueue(name) __create_workqueue((name), 1, 0, 0)
+#define create_workqueue(name)		__create_workqueue((name), 0)
+#define create_rt_workqueue(name)	__create_workqueue((name), WQ_F_RT)
+#define create_freezeable_workqueue(name)	\
+	__create_workqueue((name), WQ_F_SINGLETHREAD | WQ_F_FREEZABLE)
+#define create_singlethread_workqueue(name)	\
+	__create_workqueue((name), WQ_F_SINGLETHREAD)
 
 extern void destroy_workqueue(struct workqueue_struct *wq);
 
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 0668795..02ba7c9 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -60,9 +60,7 @@ struct workqueue_struct {
 	struct cpu_workqueue_struct *cpu_wq;
 	struct list_head list;
 	const char *name;
-	int singlethread;
-	int freezeable;		/* Freeze threads during suspend */
-	int rt;
+	unsigned int flags;	/* WQ_F_* flags */
 #ifdef CONFIG_LOCKDEP
 	struct lockdep_map lockdep_map;
 #endif
@@ -84,9 +82,9 @@ static const struct cpumask *cpu_singlethread_map __read_mostly;
 static cpumask_var_t cpu_populated_map __read_mostly;
 
 /* If it's single threaded, it isn't in the list of workqueues. */
-static inline int is_wq_single_threaded(struct workqueue_struct *wq)
+static inline bool is_wq_single_threaded(struct workqueue_struct *wq)
 {
-	return wq->singlethread;
+	return wq->flags & WQ_F_SINGLETHREAD;
 }
 
 static const struct cpumask *wq_cpu_map(struct workqueue_struct *wq)
@@ -314,7 +312,7 @@ static int worker_thread(void *__cwq)
 	struct cpu_workqueue_struct *cwq = __cwq;
 	DEFINE_WAIT(wait);
 
-	if (cwq->wq->freezeable)
+	if (cwq->wq->flags & WQ_F_FREEZABLE)
 		set_freezable();
 
 	set_user_nice(current, -5);
@@ -768,7 +766,7 @@ static int create_workqueue_thread(struct cpu_workqueue_struct *cwq, int cpu)
 	 */
 	if (IS_ERR(p))
 		return PTR_ERR(p);
-	if (cwq->wq->rt)
+	if (cwq->wq->flags & WQ_F_RT)
 		sched_setscheduler_nocheck(p, SCHED_FIFO, &param);
 	cwq->thread = p;
 
@@ -789,9 +787,7 @@ static void start_workqueue_thread(struct cpu_workqueue_struct *cwq, int cpu)
 }
 
 struct workqueue_struct *__create_workqueue_key(const char *name,
-						int singlethread,
-						int freezeable,
-						int rt,
+						unsigned int flags,
 						struct lock_class_key *key,
 						const char *lock_name)
 {
@@ -811,12 +807,10 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 
 	wq->name = name;
 	lockdep_init_map(&wq->lockdep_map, lock_name, key, 0);
-	wq->singlethread = singlethread;
-	wq->freezeable = freezeable;
-	wq->rt = rt;
+	wq->flags = flags;
 	INIT_LIST_HEAD(&wq->list);
 
-	if (singlethread) {
+	if (flags & WQ_F_SINGLETHREAD) {
 		cwq = init_cpu_workqueue(wq, singlethread_cpu);
 		err = create_workqueue_thread(cwq, singlethread_cpu);
 		start_workqueue_thread(cwq, -1);
-- 
1.6.4.173.g3f189


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 2/4] direct-io: make O_DIRECT IO path be page based
  2009-08-20 10:17   ` [PATCH 1/6] workqueue: replace singlethread/freezable/rt parameters and variables with flags Jens Axboe
@ 2009-08-20 10:17     ` Jens Axboe
  2009-08-20 10:17       ` [PATCH 2/6] workqueue: add support for lazy workqueues Jens Axboe
  0 siblings, 1 reply; 30+ messages in thread
From: Jens Axboe @ 2009-08-20 10:17 UTC (permalink / raw)
  To: linux-kernel; +Cc: jeff, benh, htejun, bzolnier, alan, Jens Axboe

Currently we pass in the iovec array and let the O_DIRECT core
handle the get_user_pages() business. This work, but it means that
we can ever only use user pages for O_DIRECT.

Switch the aops->direct_IO() and below code to use page arrays
instead, so that it doesn't make any assumptions about who the pages
belong to. This works directly for all users but NFS, which just
uses the same helper that the generic mapping read/write functions
also call.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
 fs/direct-io.c         |  304 ++++++++++++++++++++----------------------------
 fs/nfs/direct.c        |  161 +++++++++----------------
 fs/nfs/file.c          |    8 +-
 include/linux/fs.h     |   15 ++-
 include/linux/nfs_fs.h |    7 +-
 mm/filemap.c           |    6 +-
 6 files changed, 206 insertions(+), 295 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index 181848c..22a945b 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -38,12 +38,6 @@
 #include <asm/atomic.h>
 
 /*
- * How many user pages to map in one call to get_user_pages().  This determines
- * the size of a structure on the stack.
- */
-#define DIO_PAGES	64
-
-/*
  * This code generally works in units of "dio_blocks".  A dio_block is
  * somewhere between the hard sector size and the filesystem block size.  it
  * is determined on a per-invocation basis.   When talking to the filesystem
@@ -105,20 +99,13 @@ struct dio {
 	sector_t cur_page_block;	/* Where it starts */
 
 	/*
-	 * Page fetching state. These variables belong to dio_refill_pages().
-	 */
-	int curr_page;			/* changes */
-	int total_pages;		/* doesn't change */
-	unsigned long curr_user_address;/* changes */
-
-	/*
 	 * Page queue.  These variables belong to dio_refill_pages() and
 	 * dio_get_page().
 	 */
-	struct page *pages[DIO_PAGES];	/* page buffer */
-	unsigned head;			/* next page to process */
-	unsigned tail;			/* last valid page + 1 */
-	int page_errors;		/* errno from get_user_pages() */
+	struct page **pages;		/* page buffer */
+	unsigned int head_page;		/* next page to process */
+	unsigned int total_pages;	/* last valid page + 1 */
+	unsigned int first_page_off;	/* offset into first page in map */
 
 	/* BIO completion state */
 	spinlock_t bio_lock;		/* protects BIO fields below */
@@ -134,57 +121,6 @@ struct dio {
 };
 
 /*
- * How many pages are in the queue?
- */
-static inline unsigned dio_pages_present(struct dio *dio)
-{
-	return dio->tail - dio->head;
-}
-
-/*
- * Go grab and pin some userspace pages.   Typically we'll get 64 at a time.
- */
-static int dio_refill_pages(struct dio *dio)
-{
-	int ret;
-	int nr_pages;
-
-	nr_pages = min(dio->total_pages - dio->curr_page, DIO_PAGES);
-	ret = get_user_pages_fast(
-		dio->curr_user_address,		/* Where from? */
-		nr_pages,			/* How many pages? */
-		dio->rw == READ,		/* Write to memory? */
-		&dio->pages[0]);		/* Put results here */
-
-	if (ret < 0 && dio->blocks_available && (dio->rw & WRITE)) {
-		struct page *page = ZERO_PAGE(0);
-		/*
-		 * A memory fault, but the filesystem has some outstanding
-		 * mapped blocks.  We need to use those blocks up to avoid
-		 * leaking stale data in the file.
-		 */
-		if (dio->page_errors == 0)
-			dio->page_errors = ret;
-		page_cache_get(page);
-		dio->pages[0] = page;
-		dio->head = 0;
-		dio->tail = 1;
-		ret = 0;
-		goto out;
-	}
-
-	if (ret >= 0) {
-		dio->curr_user_address += ret * PAGE_SIZE;
-		dio->curr_page += ret;
-		dio->head = 0;
-		dio->tail = ret;
-		ret = 0;
-	}
-out:
-	return ret;	
-}
-
-/*
  * Get another userspace page.  Returns an ERR_PTR on error.  Pages are
  * buffered inside the dio so that we can call get_user_pages() against a
  * decent number of pages, less frequently.  To provide nicer use of the
@@ -192,15 +128,10 @@ out:
  */
 static struct page *dio_get_page(struct dio *dio)
 {
-	if (dio_pages_present(dio) == 0) {
-		int ret;
+	if (dio->head_page < dio->total_pages)
+		return dio->pages[dio->head_page++];
 
-		ret = dio_refill_pages(dio);
-		if (ret)
-			return ERR_PTR(ret);
-		BUG_ON(dio_pages_present(dio) == 0);
-	}
-	return dio->pages[dio->head++];
+	return NULL;
 }
 
 /**
@@ -245,8 +176,6 @@ static int dio_complete(struct dio *dio, loff_t offset, int ret)
 		up_read_non_owner(&dio->inode->i_alloc_sem);
 
 	if (ret == 0)
-		ret = dio->page_errors;
-	if (ret == 0)
 		ret = dio->io_error;
 	if (ret == 0)
 		ret = transferred;
@@ -351,8 +280,10 @@ static void dio_bio_submit(struct dio *dio)
  */
 static void dio_cleanup(struct dio *dio)
 {
-	while (dio_pages_present(dio))
-		page_cache_release(dio_get_page(dio));
+	struct page *page;
+
+	while ((page = dio_get_page(dio)) != NULL)
+		page_cache_release(page);
 }
 
 /*
@@ -490,7 +421,6 @@ static int dio_bio_reap(struct dio *dio)
  */
 static int get_more_blocks(struct dio *dio)
 {
-	int ret;
 	struct buffer_head *map_bh = &dio->map_bh;
 	sector_t fs_startblk;	/* Into file, in filesystem-sized blocks */
 	unsigned long fs_count;	/* Number of filesystem-sized blocks */
@@ -502,38 +432,33 @@ static int get_more_blocks(struct dio *dio)
 	 * If there was a memory error and we've overwritten all the
 	 * mapped blocks then we can now return that memory error
 	 */
-	ret = dio->page_errors;
-	if (ret == 0) {
-		BUG_ON(dio->block_in_file >= dio->final_block_in_request);
-		fs_startblk = dio->block_in_file >> dio->blkfactor;
-		dio_count = dio->final_block_in_request - dio->block_in_file;
-		fs_count = dio_count >> dio->blkfactor;
-		blkmask = (1 << dio->blkfactor) - 1;
-		if (dio_count & blkmask)	
-			fs_count++;
-
-		map_bh->b_state = 0;
-		map_bh->b_size = fs_count << dio->inode->i_blkbits;
-
-		create = dio->rw & WRITE;
-		if (dio->lock_type == DIO_LOCKING) {
-			if (dio->block_in_file < (i_size_read(dio->inode) >>
-							dio->blkbits))
-				create = 0;
-		} else if (dio->lock_type == DIO_NO_LOCKING) {
+	BUG_ON(dio->block_in_file >= dio->final_block_in_request);
+	fs_startblk = dio->block_in_file >> dio->blkfactor;
+	dio_count = dio->final_block_in_request - dio->block_in_file;
+	fs_count = dio_count >> dio->blkfactor;
+	blkmask = (1 << dio->blkfactor) - 1;
+	if (dio_count & blkmask)
+		fs_count++;
+
+	map_bh->b_state = 0;
+	map_bh->b_size = fs_count << dio->inode->i_blkbits;
+
+	create = dio->rw & WRITE;
+	if (dio->lock_type == DIO_LOCKING) {
+		if (dio->block_in_file < (i_size_read(dio->inode) >>
+						dio->blkbits))
 			create = 0;
-		}
-
-		/*
-		 * For writes inside i_size we forbid block creations: only
-		 * overwrites are permitted.  We fall back to buffered writes
-		 * at a higher level for inside-i_size block-instantiating
-		 * writes.
-		 */
-		ret = (*dio->get_block)(dio->inode, fs_startblk,
-						map_bh, create);
+	} else if (dio->lock_type == DIO_NO_LOCKING) {
+		create = 0;
 	}
-	return ret;
+
+	/*
+	 * For writes inside i_size we forbid block creations: only
+	 * overwrites are permitted.  We fall back to buffered writes
+	 * at a higher level for inside-i_size block-instantiating
+	 * writes.
+	 */
+	return dio->get_block(dio->inode, fs_startblk, map_bh, create);
 }
 
 /*
@@ -567,8 +492,8 @@ static int dio_bio_add_page(struct dio *dio)
 {
 	int ret;
 
-	ret = bio_add_page(dio->bio, dio->cur_page,
-			dio->cur_page_len, dio->cur_page_offset);
+	ret = bio_add_page(dio->bio, dio->cur_page, dio->cur_page_len,
+				dio->cur_page_offset);
 	if (ret == dio->cur_page_len) {
 		/*
 		 * Decrement count only, if we are done with this page
@@ -804,6 +729,9 @@ static int do_direct_IO(struct dio *dio)
 			unsigned this_chunk_blocks;	/* # of blocks */
 			unsigned u;
 
+			offset_in_page += dio->first_page_off;
+			dio->first_page_off = 0;
+
 			if (dio->blocks_available == 0) {
 				/*
 				 * Need to go and map some more disk
@@ -933,13 +861,10 @@ direct_io_worker(struct kiocb *iocb, struct inode *inode,
 	struct dio_args *args, unsigned blkbits, get_block_t get_block,
 	dio_iodone_t end_io, struct dio *dio)
 {
-	const struct iovec *iov = args->iov;
-	unsigned long user_addr;
 	unsigned long flags;
-	int seg, rw = args->rw;
+	int rw = args->rw;
 	ssize_t ret = 0;
 	ssize_t ret2;
-	size_t bytes;
 
 	dio->inode = inode;
 	dio->rw = rw;
@@ -965,46 +890,25 @@ direct_io_worker(struct kiocb *iocb, struct inode *inode,
 	if (unlikely(dio->blkfactor))
 		dio->pages_in_io = 2;
 
-	for (seg = 0; seg < args->nr_segs; seg++) {
-		user_addr = (unsigned long) iov[seg].iov_base;
-		dio->pages_in_io +=
-			((user_addr+iov[seg].iov_len +PAGE_SIZE-1)/PAGE_SIZE
-				- user_addr/PAGE_SIZE);
-	}
+	dio->pages_in_io += args->nr_segs;
+	dio->size = args->length;
+	if (args->user_addr) {
+		dio->first_page_off = args->user_addr & ~PAGE_MASK;
+		dio->first_block_in_page = dio->first_page_off >> blkbits;
+		if (dio->first_block_in_page)
+			dio->first_page_off -= 1 << blkbits;
+	} else
+		dio->first_page_off = args->first_page_off;
 
-	for (seg = 0; seg < args->nr_segs; seg++) {
-		user_addr = (unsigned long)iov[seg].iov_base;
-		dio->size += bytes = iov[seg].iov_len;
-
-		/* Index into the first page of the first block */
-		dio->first_block_in_page = (user_addr & ~PAGE_MASK) >> blkbits;
-		dio->final_block_in_request = dio->block_in_file +
-						(bytes >> blkbits);
-		/* Page fetching state */
-		dio->head = 0;
-		dio->tail = 0;
-		dio->curr_page = 0;
-
-		dio->total_pages = 0;
-		if (user_addr & (PAGE_SIZE-1)) {
-			dio->total_pages++;
-			bytes -= PAGE_SIZE - (user_addr & (PAGE_SIZE - 1));
-		}
-		dio->total_pages += (bytes + PAGE_SIZE - 1) / PAGE_SIZE;
-		dio->curr_user_address = user_addr;
-	
-		ret = do_direct_IO(dio);
+	dio->final_block_in_request = dio->block_in_file + (dio->size >> blkbits);
+	dio->head_page = 0;
+	dio->total_pages = args->nr_segs;
 
-		dio->result += iov[seg].iov_len -
+	ret = do_direct_IO(dio);
+
+	dio->result += args->length -
 			((dio->final_block_in_request - dio->block_in_file) <<
 					blkbits);
-
-		if (ret) {
-			dio_cleanup(dio);
-			break;
-		}
-	} /* end iovec loop */
-
 	if (ret == -ENOTBLK && (rw & WRITE)) {
 		/*
 		 * The remaining part of the request will be
@@ -1110,9 +1014,6 @@ __blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 	struct block_device *bdev, struct dio_args *args, get_block_t get_block,
 	dio_iodone_t end_io, int dio_lock_type)
 {
-	int seg;
-	size_t size;
-	unsigned long addr;
 	unsigned blkbits = inode->i_blkbits;
 	unsigned bdev_blkbits = 0;
 	unsigned blocksize_mask = (1 << blkbits) - 1;
@@ -1138,17 +1039,14 @@ __blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 	}
 
 	/* Check the memory alignment.  Blocks cannot straddle pages */
-	for (seg = 0; seg < args->nr_segs; seg++) {
-		addr = (unsigned long) args->iov[seg].iov_base;
-		size = args->iov[seg].iov_len;
-		end += size;
-		if ((addr & blocksize_mask) || (size & blocksize_mask))  {
-			if (bdev)
-				 blkbits = bdev_blkbits;
-			blocksize_mask = (1 << blkbits) - 1;
-			if ((addr & blocksize_mask) || (size & blocksize_mask))  
-				goto out;
-		}
+	if ((args->user_addr & blocksize_mask) ||
+	    (args->length & blocksize_mask))  {
+		if (bdev)
+			 blkbits = bdev_blkbits;
+		blocksize_mask = (1 << blkbits) - 1;
+		if ((args->user_addr & blocksize_mask) ||
+		    (args->length & blocksize_mask))
+			goto out;
 	}
 
 	dio = kzalloc(sizeof(*dio), GFP_KERNEL);
@@ -1156,6 +1054,8 @@ __blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 	if (!dio)
 		goto out;
 
+	dio->pages = args->pages;
+
 	/*
 	 * For block device access DIO_NO_LOCKING is used,
 	 *	neither readers nor writers do any locking at all
@@ -1232,20 +1132,70 @@ out:
 }
 EXPORT_SYMBOL(__blockdev_direct_IO);
 
-ssize_t generic_file_direct_IO(int rw, struct address_space *mapping,
-			       struct kiocb *iocb, const struct iovec *iov,
-			       loff_t offset, unsigned long nr_segs)
+static ssize_t __generic_file_direct_IO(int rw, struct address_space *mapping,
+					struct kiocb *iocb,
+					const struct iovec *iov, loff_t offset,
+					dio_io_actor *actor)
 {
+	struct page *stack_pages[UIO_FASTIOV];
+	unsigned long nr_pages, start, end;
 	struct dio_args args = {
-		.rw		= rw,
-		.iov		= iov,
-		.length		= iov_length(iov, nr_segs),
+		.pages		= stack_pages,
+		.length		= iov->iov_len,
+		.user_addr	= (unsigned long) iov->iov_base,
 		.offset		= offset,
-		.nr_segs	= nr_segs,
 	};
+	ssize_t ret;
+
+	end = (args.user_addr + iov->iov_len + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	start = args.user_addr >> PAGE_SHIFT;
+	nr_pages = end - start;
+
+	if (nr_pages >= UIO_FASTIOV) {
+		args.pages = kzalloc(nr_pages * sizeof(struct page *),
+					GFP_KERNEL);
+		if (!args.pages)
+			return -ENOMEM;
+	}
+
+	ret = get_user_pages_fast(args.user_addr, nr_pages, rw == READ,
+					args.pages);
+	if (ret > 0) {
+		args.nr_segs = ret;
+		ret = actor(iocb, &args);
+	}
 
-	if (mapping->a_ops->direct_IO)
-		return mapping->a_ops->direct_IO(iocb, &args);
+	if (args.pages != stack_pages)
+		kfree(args.pages);
 
-	return -EINVAL;
+	return ret;
+}
+
+/*
+ * Transform the iov into a page based structure for passing into the lower
+ * parts of O_DIRECT handling
+ */
+ssize_t generic_file_direct_IO(int rw, struct address_space *mapping,
+			       struct kiocb *kiocb, const struct iovec *iov,
+			       loff_t offset, unsigned long nr_segs,
+			       dio_io_actor *actor)
+{
+	ssize_t ret = 0, ret2;
+	unsigned long i;
+
+	for (i = 0; i < nr_segs; i++) {
+		ret2 = __generic_file_direct_IO(rw, mapping, kiocb, iov, offset,
+							actor);
+		if (ret2 < 0) {
+			if (!ret)
+				ret = ret2;
+			break;
+		}
+		iov++;
+		offset += ret2;
+		ret += ret2;
+	}
+
+	return ret;
 }
+EXPORT_SYMBOL_GPL(generic_file_direct_IO);
diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index 45d931b..d9da548 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -271,13 +271,12 @@ static const struct rpc_call_ops nfs_read_direct_ops = {
  * no requests have been sent, just return an error.
  */
 static ssize_t nfs_direct_read_schedule_segment(struct nfs_direct_req *dreq,
-						const struct iovec *iov,
-						loff_t pos)
+						struct dio_args *args)
 {
 	struct nfs_open_context *ctx = dreq->ctx;
 	struct inode *inode = ctx->path.dentry->d_inode;
-	unsigned long user_addr = (unsigned long)iov->iov_base;
-	size_t count = iov->iov_len;
+	unsigned long user_addr = args->user_addr;
+	size_t count = args->length;
 	size_t rsize = NFS_SERVER(inode)->rsize;
 	struct rpc_task *task;
 	struct rpc_message msg = {
@@ -306,24 +305,8 @@ static ssize_t nfs_direct_read_schedule_segment(struct nfs_direct_req *dreq,
 		if (unlikely(!data))
 			break;
 
-		down_read(&current->mm->mmap_sem);
-		result = get_user_pages(current, current->mm, user_addr,
-					data->npages, 1, 0, data->pagevec, NULL);
-		up_read(&current->mm->mmap_sem);
-		if (result < 0) {
-			nfs_readdata_free(data);
-			break;
-		}
-		if ((unsigned)result < data->npages) {
-			bytes = result * PAGE_SIZE;
-			if (bytes <= pgbase) {
-				nfs_direct_release_pages(data->pagevec, result);
-				nfs_readdata_free(data);
-				break;
-			}
-			bytes -= pgbase;
-			data->npages = result;
-		}
+		data->pagevec = args->pages;
+		data->npages = args->nr_segs;
 
 		get_dreq(dreq);
 
@@ -332,7 +315,7 @@ static ssize_t nfs_direct_read_schedule_segment(struct nfs_direct_req *dreq,
 		data->cred = msg.rpc_cred;
 		data->args.fh = NFS_FH(inode);
 		data->args.context = ctx;
-		data->args.offset = pos;
+		data->args.offset = args->offset;
 		data->args.pgbase = pgbase;
 		data->args.pages = data->pagevec;
 		data->args.count = bytes;
@@ -361,7 +344,7 @@ static ssize_t nfs_direct_read_schedule_segment(struct nfs_direct_req *dreq,
 
 		started += bytes;
 		user_addr += bytes;
-		pos += bytes;
+		args->offset += bytes;
 		/* FIXME: Remove this unnecessary math from final patch */
 		pgbase += bytes;
 		pgbase &= ~PAGE_MASK;
@@ -376,26 +359,19 @@ static ssize_t nfs_direct_read_schedule_segment(struct nfs_direct_req *dreq,
 }
 
 static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
-					      const struct iovec *iov,
-					      unsigned long nr_segs,
-					      loff_t pos)
+					      struct dio_args *args)
 {
 	ssize_t result = -EINVAL;
 	size_t requested_bytes = 0;
-	unsigned long seg;
 
 	get_dreq(dreq);
 
-	for (seg = 0; seg < nr_segs; seg++) {
-		const struct iovec *vec = &iov[seg];
-		result = nfs_direct_read_schedule_segment(dreq, vec, pos);
-		if (result < 0)
-			break;
-		requested_bytes += result;
-		if ((size_t)result < vec->iov_len)
-			break;
-		pos += vec->iov_len;
-	}
+	result = nfs_direct_read_schedule_segment(dreq, args);
+	if (result < 0)
+		goto out;
+
+	requested_bytes += result;
+	args += result;
 
 	if (put_dreq(dreq))
 		nfs_direct_complete(dreq);
@@ -403,13 +379,13 @@ static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
 	if (requested_bytes != 0)
 		return 0;
 
+out:
 	if (result < 0)
 		return result;
 	return -EIO;
 }
 
-static ssize_t nfs_direct_read(struct kiocb *iocb, const struct iovec *iov,
-			       unsigned long nr_segs, loff_t pos)
+static ssize_t nfs_direct_read(struct kiocb *iocb, struct dio_args *args)
 {
 	ssize_t result = 0;
 	struct inode *inode = iocb->ki_filp->f_mapping->host;
@@ -424,7 +400,7 @@ static ssize_t nfs_direct_read(struct kiocb *iocb, const struct iovec *iov,
 	if (!is_sync_kiocb(iocb))
 		dreq->iocb = iocb;
 
-	result = nfs_direct_read_schedule_iovec(dreq, iov, nr_segs, pos);
+	result = nfs_direct_read_schedule_iovec(dreq, args);
 	if (!result)
 		result = nfs_direct_wait(dreq);
 	nfs_direct_req_release(dreq);
@@ -691,13 +667,13 @@ static const struct rpc_call_ops nfs_write_direct_ops = {
  * no requests have been sent, just return an error.
  */
 static ssize_t nfs_direct_write_schedule_segment(struct nfs_direct_req *dreq,
-						 const struct iovec *iov,
-						 loff_t pos, int sync)
+						 struct dio_args *args,
+						 int sync)
 {
 	struct nfs_open_context *ctx = dreq->ctx;
 	struct inode *inode = ctx->path.dentry->d_inode;
-	unsigned long user_addr = (unsigned long)iov->iov_base;
-	size_t count = iov->iov_len;
+	unsigned long user_addr = args->user_addr;
+	size_t count = args->length;
 	struct rpc_task *task;
 	struct rpc_message msg = {
 		.rpc_cred = ctx->cred,
@@ -726,24 +702,8 @@ static ssize_t nfs_direct_write_schedule_segment(struct nfs_direct_req *dreq,
 		if (unlikely(!data))
 			break;
 
-		down_read(&current->mm->mmap_sem);
-		result = get_user_pages(current, current->mm, user_addr,
-					data->npages, 0, 0, data->pagevec, NULL);
-		up_read(&current->mm->mmap_sem);
-		if (result < 0) {
-			nfs_writedata_free(data);
-			break;
-		}
-		if ((unsigned)result < data->npages) {
-			bytes = result * PAGE_SIZE;
-			if (bytes <= pgbase) {
-				nfs_direct_release_pages(data->pagevec, result);
-				nfs_writedata_free(data);
-				break;
-			}
-			bytes -= pgbase;
-			data->npages = result;
-		}
+		data->pagevec = args->pages;
+		data->npages = args->nr_segs;
 
 		get_dreq(dreq);
 
@@ -754,7 +714,7 @@ static ssize_t nfs_direct_write_schedule_segment(struct nfs_direct_req *dreq,
 		data->cred = msg.rpc_cred;
 		data->args.fh = NFS_FH(inode);
 		data->args.context = ctx;
-		data->args.offset = pos;
+		data->args.offset = args->offset;
 		data->args.pgbase = pgbase;
 		data->args.pages = data->pagevec;
 		data->args.count = bytes;
@@ -784,7 +744,7 @@ static ssize_t nfs_direct_write_schedule_segment(struct nfs_direct_req *dreq,
 
 		started += bytes;
 		user_addr += bytes;
-		pos += bytes;
+		args->offset += bytes;
 
 		/* FIXME: Remove this useless math from the final patch */
 		pgbase += bytes;
@@ -800,27 +760,19 @@ static ssize_t nfs_direct_write_schedule_segment(struct nfs_direct_req *dreq,
 }
 
 static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
-					       const struct iovec *iov,
-					       unsigned long nr_segs,
-					       loff_t pos, int sync)
+					       struct dio_args *args, int sync)
 {
 	ssize_t result = 0;
 	size_t requested_bytes = 0;
-	unsigned long seg;
 
 	get_dreq(dreq);
 
-	for (seg = 0; seg < nr_segs; seg++) {
-		const struct iovec *vec = &iov[seg];
-		result = nfs_direct_write_schedule_segment(dreq, vec,
-							   pos, sync);
-		if (result < 0)
-			break;
-		requested_bytes += result;
-		if ((size_t)result < vec->iov_len)
-			break;
-		pos += vec->iov_len;
-	}
+	result = nfs_direct_write_schedule_segment(dreq, args, sync);
+	if (result < 0)
+		goto out;
+
+	requested_bytes += result;
+	args->offset += result;
 
 	if (put_dreq(dreq))
 		nfs_direct_write_complete(dreq, dreq->inode);
@@ -828,14 +780,13 @@ static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
 	if (requested_bytes != 0)
 		return 0;
 
+out:
 	if (result < 0)
 		return result;
 	return -EIO;
 }
 
-static ssize_t nfs_direct_write(struct kiocb *iocb, const struct iovec *iov,
-				unsigned long nr_segs, loff_t pos,
-				size_t count)
+static ssize_t nfs_direct_write(struct kiocb *iocb, struct dio_args *args)
 {
 	ssize_t result = 0;
 	struct inode *inode = iocb->ki_filp->f_mapping->host;
@@ -848,7 +799,7 @@ static ssize_t nfs_direct_write(struct kiocb *iocb, const struct iovec *iov,
 		return -ENOMEM;
 	nfs_alloc_commit_data(dreq);
 
-	if (dreq->commit_data == NULL || count < wsize)
+	if (dreq->commit_data == NULL || args->length < wsize)
 		sync = NFS_FILE_SYNC;
 
 	dreq->inode = inode;
@@ -856,7 +807,7 @@ static ssize_t nfs_direct_write(struct kiocb *iocb, const struct iovec *iov,
 	if (!is_sync_kiocb(iocb))
 		dreq->iocb = iocb;
 
-	result = nfs_direct_write_schedule_iovec(dreq, iov, nr_segs, pos, sync);
+	result = nfs_direct_write_schedule_iovec(dreq, args, sync);
 	if (!result)
 		result = nfs_direct_wait(dreq);
 	nfs_direct_req_release(dreq);
@@ -867,9 +818,7 @@ static ssize_t nfs_direct_write(struct kiocb *iocb, const struct iovec *iov,
 /**
  * nfs_file_direct_read - file direct read operation for NFS files
  * @iocb: target I/O control block
- * @iov: vector of user buffers into which to read data
- * @nr_segs: size of iov vector
- * @pos: byte offset in file where reading starts
+ * @args: direct IO arguments
  *
  * We use this function for direct reads instead of calling
  * generic_file_aio_read() in order to avoid gfar's check to see if
@@ -885,21 +834,20 @@ static ssize_t nfs_direct_write(struct kiocb *iocb, const struct iovec *iov,
  * client must read the updated atime from the server back into its
  * cache.
  */
-ssize_t nfs_file_direct_read(struct kiocb *iocb, const struct iovec *iov,
-				unsigned long nr_segs, loff_t pos)
+static ssize_t nfs_file_direct_read(struct kiocb *iocb, struct dio_args *args)
 {
 	ssize_t retval = -EINVAL;
 	struct file *file = iocb->ki_filp;
 	struct address_space *mapping = file->f_mapping;
 	size_t count;
 
-	count = iov_length(iov, nr_segs);
+	count = args->length;
 	nfs_add_stats(mapping->host, NFSIOS_DIRECTREADBYTES, count);
 
 	dfprintk(FILE, "NFS: direct read(%s/%s, %zd@%Ld)\n",
 		file->f_path.dentry->d_parent->d_name.name,
 		file->f_path.dentry->d_name.name,
-		count, (long long) pos);
+		count, (long long) args->offset);
 
 	retval = 0;
 	if (!count)
@@ -909,9 +857,9 @@ ssize_t nfs_file_direct_read(struct kiocb *iocb, const struct iovec *iov,
 	if (retval)
 		goto out;
 
-	retval = nfs_direct_read(iocb, iov, nr_segs, pos);
+	retval = nfs_direct_read(iocb, args);
 	if (retval > 0)
-		iocb->ki_pos = pos + retval;
+		iocb->ki_pos = args->offset + retval;
 
 out:
 	return retval;
@@ -920,9 +868,7 @@ out:
 /**
  * nfs_file_direct_write - file direct write operation for NFS files
  * @iocb: target I/O control block
- * @iov: vector of user buffers from which to write data
- * @nr_segs: size of iov vector
- * @pos: byte offset in file where writing starts
+ * @args: direct IO arguments
  *
  * We use this function for direct writes instead of calling
  * generic_file_aio_write() in order to avoid taking the inode
@@ -942,23 +888,22 @@ out:
  * Note that O_APPEND is not supported for NFS direct writes, as there
  * is no atomic O_APPEND write facility in the NFS protocol.
  */
-ssize_t nfs_file_direct_write(struct kiocb *iocb, const struct iovec *iov,
-				unsigned long nr_segs, loff_t pos)
+static ssize_t nfs_file_direct_write(struct kiocb *iocb, struct dio_args *args)
 {
 	ssize_t retval = -EINVAL;
 	struct file *file = iocb->ki_filp;
 	struct address_space *mapping = file->f_mapping;
 	size_t count;
 
-	count = iov_length(iov, nr_segs);
+	count = args->length;
 	nfs_add_stats(mapping->host, NFSIOS_DIRECTWRITTENBYTES, count);
 
 	dfprintk(FILE, "NFS: direct write(%s/%s, %zd@%Ld)\n",
 		file->f_path.dentry->d_parent->d_name.name,
 		file->f_path.dentry->d_name.name,
-		count, (long long) pos);
+		count, (long long) args->offset);
 
-	retval = generic_write_checks(file, &pos, &count, 0);
+	retval = generic_write_checks(file, &args->offset, &count, 0);
 	if (retval)
 		goto out;
 
@@ -973,15 +918,23 @@ ssize_t nfs_file_direct_write(struct kiocb *iocb, const struct iovec *iov,
 	if (retval)
 		goto out;
 
-	retval = nfs_direct_write(iocb, iov, nr_segs, pos, count);
+	retval = nfs_direct_write(iocb, args);
 
 	if (retval > 0)
-		iocb->ki_pos = pos + retval;
+		iocb->ki_pos = args->offset + retval;
 
 out:
 	return retval;
 }
 
+ssize_t nfs_file_direct_io(struct kiocb *kiocb, struct dio_args *args)
+{
+	if (args->rw == READ)
+		return nfs_file_direct_read(kiocb, args);
+
+	return nfs_file_direct_write(kiocb, args);
+}
+
 /**
  * nfs_init_directcache - create a slab cache for nfs_direct_req structures
  *
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 0506232..97d8cc7 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -249,13 +249,15 @@ static ssize_t
 nfs_file_read(struct kiocb *iocb, const struct iovec *iov,
 		unsigned long nr_segs, loff_t pos)
 {
+	struct address_space *mapping = iocb->ki_filp->f_mapping;
 	struct dentry * dentry = iocb->ki_filp->f_path.dentry;
 	struct inode * inode = dentry->d_inode;
 	ssize_t result;
 	size_t count = iov_length(iov, nr_segs);
 
 	if (iocb->ki_filp->f_flags & O_DIRECT)
-		return nfs_file_direct_read(iocb, iov, nr_segs, pos);
+		return generic_file_direct_IO(READ, mapping, iocb, iov, pos,
+						nr_segs, nfs_file_direct_io);
 
 	dprintk("NFS: read(%s/%s, %lu@%lu)\n",
 		dentry->d_parent->d_name.name, dentry->d_name.name,
@@ -546,13 +548,15 @@ static int nfs_need_sync_write(struct file *filp, struct inode *inode)
 static ssize_t nfs_file_write(struct kiocb *iocb, const struct iovec *iov,
 				unsigned long nr_segs, loff_t pos)
 {
+	struct address_space *mapping = iocb->ki_filp->f_mapping;
 	struct dentry * dentry = iocb->ki_filp->f_path.dentry;
 	struct inode * inode = dentry->d_inode;
 	ssize_t result;
 	size_t count = iov_length(iov, nr_segs);
 
 	if (iocb->ki_filp->f_flags & O_DIRECT)
-		return nfs_file_direct_write(iocb, iov, nr_segs, pos);
+		return generic_file_direct_IO(WRITE, mapping, iocb, iov, pos,
+						nr_segs, nfs_file_direct_io);
 
 	dprintk("NFS: write(%s/%s, %lu@%Ld)\n",
 		dentry->d_parent->d_name.name, dentry->d_name.name,
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 5971116..539994a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2247,18 +2247,27 @@ static inline int xip_truncate_page(struct address_space *mapping, loff_t from)
  */
 struct dio_args {
 	int rw;
-	const struct iovec *iov;
+	struct page **pages;
+	unsigned int first_page_off;
+	unsigned long nr_segs;
 	unsigned long length;
 	loff_t offset;
-	unsigned long nr_segs;
+
+	/*
+	 * Original user pointer, we'll get rid of this
+	 */
+	unsigned long user_addr;
 };
 
 ssize_t __blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 	struct block_device *bdev, struct dio_args *args, get_block_t get_block,
 	dio_iodone_t end_io, int lock_type);
 
+typedef ssize_t (dio_io_actor)(struct kiocb *, struct dio_args *);
+
 ssize_t generic_file_direct_IO(int, struct address_space *, struct kiocb *,
-				const struct iovec *, loff_t, unsigned long);
+				const struct iovec *, loff_t, unsigned long,
+				dio_io_actor);
 
 enum {
 	DIO_LOCKING = 1, /* need locking between buffered and direct access */
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index 97a2383..ded8337 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -409,12 +409,7 @@ extern int nfs3_removexattr (struct dentry *, const char *name);
  * linux/fs/nfs/direct.c
  */
 extern ssize_t nfs_direct_IO(struct kiocb *, struct dio_args *);
-extern ssize_t nfs_file_direct_read(struct kiocb *iocb,
-			const struct iovec *iov, unsigned long nr_segs,
-			loff_t pos);
-extern ssize_t nfs_file_direct_write(struct kiocb *iocb,
-			const struct iovec *iov, unsigned long nr_segs,
-			loff_t pos);
+extern ssize_t nfs_file_direct_io(struct kiocb *, struct dio_args *);
 
 /*
  * linux/fs/nfs/dir.c
diff --git a/mm/filemap.c b/mm/filemap.c
index cf85298..3e03021 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1346,8 +1346,8 @@ generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov,
 					pos + iov_length(iov, nr_segs) - 1);
 			if (!retval) {
 				retval = generic_file_direct_IO(READ, mapping,
-								iocb, iov,
-								pos, nr_segs);
+						iocb, iov, pos, nr_segs,
+						mapping->a_ops->direct_IO);
 			}
 			if (retval > 0)
 				*ppos = pos + retval;
@@ -2146,7 +2146,7 @@ generic_file_direct_write(struct kiocb *iocb, const struct iovec *iov,
 	}
 
 	written = generic_file_direct_IO(WRITE, mapping, iocb, iov, pos,
-						*nr_segs);
+					*nr_segs, mapping->a_ops->direct_IO);
 
 	/*
 	 * Finally, try again to invalidate clean pages which might have been
-- 
1.6.4.53.g3f55e


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 2/6] workqueue: add support for lazy workqueues
  2009-08-20 10:17     ` [PATCH 2/4] direct-io: make O_DIRECT IO path be page based Jens Axboe
@ 2009-08-20 10:17       ` Jens Axboe
  2009-08-21  0:20         ` Andrew Morton
  0 siblings, 1 reply; 30+ messages in thread
From: Jens Axboe @ 2009-08-20 10:17 UTC (permalink / raw)
  To: linux-kernel; +Cc: jeff, benh, htejun, bzolnier, alan, Jens Axboe

Lazy workqueues are like normal workqueues, except they don't
start a thread per CPU by default. Instead threads are started
when they are needed, and exit when they have been idle for
some time.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
 include/linux/workqueue.h |    5 ++
 kernel/workqueue.c        |  152 ++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 147 insertions(+), 10 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index f14e20e..b2dd267 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -32,6 +32,7 @@ struct work_struct {
 #ifdef CONFIG_LOCKDEP
 	struct lockdep_map lockdep_map;
 #endif
+	unsigned int cpu;
 };
 
 #define WORK_DATA_INIT()	ATOMIC_LONG_INIT(0)
@@ -172,6 +173,7 @@ enum {
 	WQ_F_SINGLETHREAD	= 1,
 	WQ_F_FREEZABLE		= 2,
 	WQ_F_RT			= 4,
+	WQ_F_LAZY		= 8,
 };
 
 #ifdef CONFIG_LOCKDEP
@@ -198,6 +200,7 @@ enum {
 	__create_workqueue((name), WQ_F_SINGLETHREAD | WQ_F_FREEZABLE)
 #define create_singlethread_workqueue(name)	\
 	__create_workqueue((name), WQ_F_SINGLETHREAD)
+#define create_lazy_workqueue(name)	__create_workqueue((name), WQ_F_LAZY)
 
 extern void destroy_workqueue(struct workqueue_struct *wq);
 
@@ -211,6 +214,8 @@ extern int queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
 
 extern void flush_workqueue(struct workqueue_struct *wq);
 extern void flush_scheduled_work(void);
+extern void workqueue_set_lazy_timeout(struct workqueue_struct *wq,
+			unsigned long timeout);
 
 extern int schedule_work(struct work_struct *work);
 extern int schedule_work_on(int cpu, struct work_struct *work);
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 02ba7c9..d9ccebc 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -61,11 +61,17 @@ struct workqueue_struct {
 	struct list_head list;
 	const char *name;
 	unsigned int flags;	/* WQ_F_* flags */
+	unsigned long lazy_timeout;
+	unsigned int core_cpu;
 #ifdef CONFIG_LOCKDEP
 	struct lockdep_map lockdep_map;
 #endif
 };
 
+/* Default lazy workqueue timeout */
+#define WQ_DEF_LAZY_TIMEOUT	(60 * HZ)
+
+
 /* Serializes the accesses to the list of workqueues. */
 static DEFINE_SPINLOCK(workqueue_lock);
 static LIST_HEAD(workqueues);
@@ -81,6 +87,8 @@ static const struct cpumask *cpu_singlethread_map __read_mostly;
  */
 static cpumask_var_t cpu_populated_map __read_mostly;
 
+static int create_workqueue_thread(struct cpu_workqueue_struct *cwq, int cpu);
+
 /* If it's single threaded, it isn't in the list of workqueues. */
 static inline bool is_wq_single_threaded(struct workqueue_struct *wq)
 {
@@ -141,11 +149,29 @@ static void insert_work(struct cpu_workqueue_struct *cwq,
 static void __queue_work(struct cpu_workqueue_struct *cwq,
 			 struct work_struct *work)
 {
+	struct workqueue_struct *wq = cwq->wq;
 	unsigned long flags;
 
-	spin_lock_irqsave(&cwq->lock, flags);
-	insert_work(cwq, work, &cwq->worklist);
-	spin_unlock_irqrestore(&cwq->lock, flags);
+	/*
+	 * This is a lazy workqueue and this particular CPU thread has
+	 * exited. We can't create it from here, so add this work on our
+	 * static thread. It will create this thread and move the work there.
+	 */
+	if ((wq->flags & WQ_F_LAZY) && !cwq->thread) {
+		struct cpu_workqueue_struct *__cwq;
+
+		local_irq_save(flags);
+		__cwq = wq_per_cpu(wq, wq->core_cpu);
+		work->cpu = smp_processor_id();
+		spin_lock(&__cwq->lock);
+		insert_work(__cwq, work, &__cwq->worklist);
+		spin_unlock_irqrestore(&__cwq->lock, flags);
+	} else {
+		spin_lock_irqsave(&cwq->lock, flags);
+		work->cpu = smp_processor_id();
+		insert_work(cwq, work, &cwq->worklist);
+		spin_unlock_irqrestore(&cwq->lock, flags);
+	}
 }
 
 /**
@@ -259,13 +285,16 @@ int queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
 }
 EXPORT_SYMBOL_GPL(queue_delayed_work_on);
 
-static void run_workqueue(struct cpu_workqueue_struct *cwq)
+static int run_workqueue(struct cpu_workqueue_struct *cwq)
 {
+	int did_work = 0;
+
 	spin_lock_irq(&cwq->lock);
 	while (!list_empty(&cwq->worklist)) {
 		struct work_struct *work = list_entry(cwq->worklist.next,
 						struct work_struct, entry);
 		work_func_t f = work->func;
+		int cpu;
 #ifdef CONFIG_LOCKDEP
 		/*
 		 * It is permissible to free the struct work_struct
@@ -280,7 +309,34 @@ static void run_workqueue(struct cpu_workqueue_struct *cwq)
 		trace_workqueue_execution(cwq->thread, work);
 		cwq->current_work = work;
 		list_del_init(cwq->worklist.next);
+		cpu = smp_processor_id();
 		spin_unlock_irq(&cwq->lock);
+		did_work = 1;
+
+		/*
+		 * If work->cpu isn't us, then we need to create the target
+		 * workqueue thread (if someone didn't already do that) and
+		 * move the work over there.
+		 */
+		if ((cwq->wq->flags & WQ_F_LAZY) && work->cpu != cpu) {
+			struct cpu_workqueue_struct *__cwq;
+			struct task_struct *p;
+			int err;
+
+			__cwq = wq_per_cpu(cwq->wq, work->cpu);
+			p = __cwq->thread;
+			if (!p)
+				err = create_workqueue_thread(__cwq, work->cpu);
+			p = __cwq->thread;
+			if (p) {
+				if (work->cpu >= 0)
+					kthread_bind(p, work->cpu);
+				insert_work(__cwq, work, &__cwq->worklist);
+				wake_up_process(p);
+				goto out;
+			}
+		}
+
 
 		BUG_ON(get_wq_data(work) != cwq);
 		work_clear_pending(work);
@@ -305,24 +361,45 @@ static void run_workqueue(struct cpu_workqueue_struct *cwq)
 		cwq->current_work = NULL;
 	}
 	spin_unlock_irq(&cwq->lock);
+out:
+	return did_work;
 }
 
 static int worker_thread(void *__cwq)
 {
 	struct cpu_workqueue_struct *cwq = __cwq;
+	struct workqueue_struct *wq = cwq->wq;
+	unsigned long last_active = jiffies;
 	DEFINE_WAIT(wait);
+	int may_exit;
 
-	if (cwq->wq->flags & WQ_F_FREEZABLE)
+	if (wq->flags & WQ_F_FREEZABLE)
 		set_freezable();
 
 	set_user_nice(current, -5);
 
+	/*
+	 * Allow exit if this isn't our core thread
+	 */
+	if ((wq->flags & WQ_F_LAZY) && smp_processor_id() != wq->core_cpu)
+		may_exit = 1;
+	else
+		may_exit = 0;
+
 	for (;;) {
+		int did_work;
+
 		prepare_to_wait(&cwq->more_work, &wait, TASK_INTERRUPTIBLE);
 		if (!freezing(current) &&
 		    !kthread_should_stop() &&
-		    list_empty(&cwq->worklist))
-			schedule();
+		    list_empty(&cwq->worklist)) {
+			unsigned long timeout = wq->lazy_timeout;
+
+			if (timeout && may_exit)
+				schedule_timeout(timeout);
+			else
+				schedule();
+		}
 		finish_wait(&cwq->more_work, &wait);
 
 		try_to_freeze();
@@ -330,7 +407,19 @@ static int worker_thread(void *__cwq)
 		if (kthread_should_stop())
 			break;
 
-		run_workqueue(cwq);
+		did_work = run_workqueue(cwq);
+
+		/*
+		 * If we did no work for the defined timeout period and we are
+		 * allowed to exit, do so.
+		 */
+		if (did_work)
+			last_active = jiffies;
+		else if (time_after(jiffies, last_active + wq->lazy_timeout) &&
+			 may_exit) {
+			cwq->thread = NULL;
+			break;
+		}
 	}
 
 	return 0;
@@ -814,7 +903,10 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 		cwq = init_cpu_workqueue(wq, singlethread_cpu);
 		err = create_workqueue_thread(cwq, singlethread_cpu);
 		start_workqueue_thread(cwq, -1);
+		wq->core_cpu = singlethread_cpu;
 	} else {
+		int created = 0;
+
 		cpu_maps_update_begin();
 		/*
 		 * We must place this wq on list even if the code below fails.
@@ -833,10 +925,16 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 		 */
 		for_each_possible_cpu(cpu) {
 			cwq = init_cpu_workqueue(wq, cpu);
-			if (err || !cpu_online(cpu))
+			if (err || !cpu_online(cpu) ||
+			    (created && (wq->flags & WQ_F_LAZY)))
 				continue;
 			err = create_workqueue_thread(cwq, cpu);
 			start_workqueue_thread(cwq, cpu);
+			if (!err) {
+				if (!created)
+					wq->core_cpu = cpu;
+				created++;
+			}
 		}
 		cpu_maps_update_done();
 	}
@@ -844,7 +942,9 @@ struct workqueue_struct *__create_workqueue_key(const char *name,
 	if (err) {
 		destroy_workqueue(wq);
 		wq = NULL;
-	}
+	} else if (wq->flags & WQ_F_LAZY)
+		workqueue_set_lazy_timeout(wq, WQ_DEF_LAZY_TIMEOUT);
+
 	return wq;
 }
 EXPORT_SYMBOL_GPL(__create_workqueue_key);
@@ -877,6 +977,13 @@ static void cleanup_workqueue_thread(struct cpu_workqueue_struct *cwq)
 	cwq->thread = NULL;
 }
 
+static bool hotplug_should_start_thread(struct workqueue_struct *wq, int cpu)
+{
+	if ((wq->flags & WQ_F_LAZY) && cpu != wq->core_cpu)
+		return 0;
+	return 1;
+}
+
 /**
  * destroy_workqueue - safely terminate a workqueue
  * @wq: target workqueue
@@ -923,6 +1030,8 @@ undo:
 
 		switch (action) {
 		case CPU_UP_PREPARE:
+			if (!hotplug_should_start_thread(wq, cpu))
+				break;
 			if (!create_workqueue_thread(cwq, cpu))
 				break;
 			printk(KERN_ERR "workqueue [%s] for %i failed\n",
@@ -932,6 +1041,8 @@ undo:
 			goto undo;
 
 		case CPU_ONLINE:
+			if (!hotplug_should_start_thread(wq, cpu))
+				break;
 			start_workqueue_thread(cwq, cpu);
 			break;
 
@@ -999,6 +1110,27 @@ long work_on_cpu(unsigned int cpu, long (*fn)(void *), void *arg)
 EXPORT_SYMBOL_GPL(work_on_cpu);
 #endif /* CONFIG_SMP */
 
+/**
+ * workqueue_set_lazy_timeout - set lazy exit timeout
+ * @wq: the associated workqueue_struct
+ * @timeout: timeout in jiffies
+ *
+ * This will set the timeout for a lazy workqueue. If no work has been
+ * processed for @timeout jiffies, then the workqueue is allowed to exit.
+ * It will be dynamically created again when work is queued to it.
+ *
+ * Note that this only works for workqueues created with
+ * create_lazy_workqueue().
+ */
+void workqueue_set_lazy_timeout(struct workqueue_struct *wq,
+				unsigned long timeout)
+{
+	if (WARN_ON(!(wq->flags & WQ_F_LAZY)))
+		return;
+
+	wq->lazy_timeout = timeout;
+}
+
 void __init init_workqueues(void)
 {
 	alloc_cpumask_var(&cpu_populated_map, GFP_KERNEL);
-- 
1.6.4.173.g3f189


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/6] workqueue: add support for lazy workqueues
  2009-08-20 10:17       ` [PATCH 2/6] workqueue: add support for lazy workqueues Jens Axboe
@ 2009-08-21  0:20         ` Andrew Morton
  2009-08-24  8:06           ` Jens Axboe
  0 siblings, 1 reply; 30+ messages in thread
From: Andrew Morton @ 2009-08-21  0:20 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, jeff, benh, htejun, bzolnier, alan, jens.axboe

On Thu, 20 Aug 2009 12:17:39 +0200
Jens Axboe <jens.axboe@oracle.com> wrote:

> Lazy workqueues are like normal workqueues, except they don't
> start a thread per CPU by default. Instead threads are started
> when they are needed, and exit when they have been idle for
> some time.
> 
>
> ...
>
> @@ -280,7 +309,34 @@ static void run_workqueue(struct cpu_workqueue_struct *cwq)
>  		trace_workqueue_execution(cwq->thread, work);
>  		cwq->current_work = work;
>  		list_del_init(cwq->worklist.next);
> +		cpu = smp_processor_id();
>  		spin_unlock_irq(&cwq->lock);
> +		did_work = 1;
> +
> +		/*
> +		 * If work->cpu isn't us, then we need to create the target
> +		 * workqueue thread (if someone didn't already do that) and
> +		 * move the work over there.
> +		 */
> +		if ((cwq->wq->flags & WQ_F_LAZY) && work->cpu != cpu) {
> +			struct cpu_workqueue_struct *__cwq;
> +			struct task_struct *p;
> +			int err;
> +
> +			__cwq = wq_per_cpu(cwq->wq, work->cpu);
> +			p = __cwq->thread;
> +			if (!p)
> +				err = create_workqueue_thread(__cwq, work->cpu);
> +			p = __cwq->thread;
> +			if (p) {
> +				if (work->cpu >= 0)

It's an unsigned int.  This test is always true.

> +					kthread_bind(p, work->cpu);

I wonder what happens if work->cpu isn't online any more.

> +				insert_work(__cwq, work, &__cwq->worklist);
> +				wake_up_process(p);
> +				goto out;
> +			}
> +		}
> +
>  
>  		BUG_ON(get_wq_data(work) != cwq);
>  		work_clear_pending(work);
>
> ...
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/6] workqueue: add support for lazy workqueues
  2009-08-21  0:20         ` Andrew Morton
@ 2009-08-24  8:06           ` Jens Axboe
  0 siblings, 0 replies; 30+ messages in thread
From: Jens Axboe @ 2009-08-24  8:06 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, jeff, benh, htejun, bzolnier, alan

On Thu, Aug 20 2009, Andrew Morton wrote:
> On Thu, 20 Aug 2009 12:17:39 +0200
> Jens Axboe <jens.axboe@oracle.com> wrote:
> 
> > Lazy workqueues are like normal workqueues, except they don't
> > start a thread per CPU by default. Instead threads are started
> > when they are needed, and exit when they have been idle for
> > some time.
> > 
> >
> > ...
> >
> > @@ -280,7 +309,34 @@ static void run_workqueue(struct cpu_workqueue_struct *cwq)
> >  		trace_workqueue_execution(cwq->thread, work);
> >  		cwq->current_work = work;
> >  		list_del_init(cwq->worklist.next);
> > +		cpu = smp_processor_id();
> >  		spin_unlock_irq(&cwq->lock);
> > +		did_work = 1;
> > +
> > +		/*
> > +		 * If work->cpu isn't us, then we need to create the target
> > +		 * workqueue thread (if someone didn't already do that) and
> > +		 * move the work over there.
> > +		 */
> > +		if ((cwq->wq->flags & WQ_F_LAZY) && work->cpu != cpu) {
> > +			struct cpu_workqueue_struct *__cwq;
> > +			struct task_struct *p;
> > +			int err;
> > +
> > +			__cwq = wq_per_cpu(cwq->wq, work->cpu);
> > +			p = __cwq->thread;
> > +			if (!p)
> > +				err = create_workqueue_thread(__cwq, work->cpu);
> > +			p = __cwq->thread;
> > +			if (p) {
> > +				if (work->cpu >= 0)
> 
> It's an unsigned int.  This test is always true.
> 
> > +					kthread_bind(p, work->cpu);
> 
> I wonder what happens if work->cpu isn't online any more.

That's a good question. The workqueue "documentation" states that it is
the callers responsibility to ensure that the CPU stays online, but I
think that requirement is pretty much ignored. Probably since it'd be
costly to do.

So that bits needs looking into.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2009-08-24  9:12 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-08-20 10:19 [PATCH 0/6] Lazy workqueues Jens Axboe
2009-08-20 10:19 ` [PATCH 1/6] workqueue: replace singlethread/freezable/rt parameters and variables with flags Jens Axboe
2009-08-20 10:20 ` [PATCH 2/6] workqueue: add support for lazy workqueues Jens Axboe
2009-08-20 12:01   ` Frederic Weisbecker
2009-08-20 12:10     ` Jens Axboe
2009-08-20 10:20 ` [PATCH 3/6] crypto: use " Jens Axboe
2009-08-20 10:20 ` [PATCH 4/6] libata: use lazy workqueues for the pio task Jens Axboe
2009-08-20 12:40   ` Stefan Richter
2009-08-20 12:48     ` Jens Axboe
2009-08-20 10:20 ` [PATCH 5/6] aio: use lazy workqueues Jens Axboe
2009-08-20 15:09   ` Jeff Moyer
2009-08-21 18:31     ` Zach Brown
2009-08-20 10:20 ` [PATCH 6/6] sunrpc: " Jens Axboe
2009-08-20 12:04 ` [PATCH 0/6] Lazy workqueues Peter Zijlstra
2009-08-20 12:08   ` Jens Axboe
2009-08-20 12:16     ` Peter Zijlstra
2009-08-23  2:42       ` Junio C Hamano
2009-08-24  7:04         ` git send-email defaults Peter Zijlstra
2009-08-24  8:04         ` [PATCH 0/6] Lazy workqueues Jens Axboe
2009-08-24  9:03           ` Junio C Hamano
2009-08-24  9:11             ` Peter Zijlstra
2009-08-20 12:22 ` Frederic Weisbecker
2009-08-20 12:41   ` Jens Axboe
2009-08-20 13:04     ` Tejun Heo
2009-08-20 12:59   ` Steven Whitehouse
2009-08-20 12:55 ` Tejun Heo
2009-08-21  6:58   ` Jens Axboe
  -- strict thread matches above, loose matches on Subject: below --
2009-08-20 10:17 Jens Axboe
2009-08-20 10:17 ` [PATCH 1/4] direct-io: unify argument passing by adding a dio_args structure Jens Axboe
2009-08-20 10:17   ` [PATCH 1/6] workqueue: replace singlethread/freezable/rt parameters and variables with flags Jens Axboe
2009-08-20 10:17     ` [PATCH 2/4] direct-io: make O_DIRECT IO path be page based Jens Axboe
2009-08-20 10:17       ` [PATCH 2/6] workqueue: add support for lazy workqueues Jens Axboe
2009-08-21  0:20         ` Andrew Morton
2009-08-24  8:06           ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).