public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] file: kill unnecessary timer in fdtable_defer
@ 2006-08-20 13:15 Tejun Heo
  2006-08-21  4:32 ` Dipankar Sarma
  0 siblings, 1 reply; 7+ messages in thread
From: Tejun Heo @ 2006-08-20 13:15 UTC (permalink / raw)
  To: viro, dipankar, linux-kernel

free_fdtable_rc() schedules timer to reschedule fddef->wq if
schedule_work() on it returns 0.  However, schedule_work() guarantees
that the target work is executed at least once after the scheduling
regardless of its return value.  0 return simply means that the work
was already pending and thus no further action was required.

Another problem is that it used contant '5' as @expires argument to
mod_timer().

Kill unnecessary fddef->timer.

Signed-off-by: Tejun Heo <htejun@gmail.com>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
---
 fs/file.c |   29 ++---------------------------
 1 file changed, 2 insertions(+), 27 deletions(-)

Index: work/fs/file.c
===================================================================
--- work.orig/fs/file.c
+++ work/fs/file.c
@@ -21,7 +21,6 @@
 struct fdtable_defer {
 	spinlock_t lock;
 	struct work_struct wq;
-	struct timer_list timer;
 	struct fdtable *next;
 };
 
@@ -75,22 +74,6 @@ static void __free_fdtable(struct fdtabl
 	kfree(fdt);
 }
 
-static void fdtable_timer(unsigned long data)
-{
-	struct fdtable_defer *fddef = (struct fdtable_defer *)data;
-
-	spin_lock(&fddef->lock);
-	/*
-	 * If someone already emptied the queue return.
-	 */
-	if (!fddef->next)
-		goto out;
-	if (!schedule_work(&fddef->wq))
-		mod_timer(&fddef->timer, 5);
-out:
-	spin_unlock(&fddef->lock);
-}
-
 static void free_fdtable_work(struct fdtable_defer *f)
 {
 	struct fdtable *fdt;
@@ -142,13 +125,8 @@ static void free_fdtable_rcu(struct rcu_
 		spin_lock(&fddef->lock);
 		fdt->next = fddef->next;
 		fddef->next = fdt;
-		/*
-		 * vmallocs are handled from the workqueue context.
-		 * If the per-cpu workqueue is running, then we
-		 * defer work scheduling through a timer.
-		 */
-		if (!schedule_work(&fddef->wq))
-			mod_timer(&fddef->timer, 5);
+		/* vmallocs are handled from the workqueue context */
+		schedule_work(&fddef->wq);
 		spin_unlock(&fddef->lock);
 		put_cpu_var(fdtable_defer_list);
 	}
@@ -362,9 +340,6 @@ static void __devinit fdtable_defer_list
 	struct fdtable_defer *fddef = &per_cpu(fdtable_defer_list, cpu);
 	spin_lock_init(&fddef->lock);
 	INIT_WORK(&fddef->wq, (void (*)(void *))free_fdtable_work, fddef);
-	init_timer(&fddef->timer);
-	fddef->timer.data = (unsigned long)fddef;
-	fddef->timer.function = fdtable_timer;
 	fddef->next = NULL;
 }
 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] file: kill unnecessary timer in fdtable_defer
  2006-08-20 13:15 [PATCH] file: kill unnecessary timer in fdtable_defer Tejun Heo
@ 2006-08-21  4:32 ` Dipankar Sarma
  2006-08-21  5:18   ` Tejun Heo
  0 siblings, 1 reply; 7+ messages in thread
From: Dipankar Sarma @ 2006-08-21  4:32 UTC (permalink / raw)
  To: Tejun Heo; +Cc: viro, linux-kernel

On Sun, Aug 20, 2006 at 10:15:42PM +0900, Tejun Heo wrote:
> free_fdtable_rc() schedules timer to reschedule fddef->wq if
> schedule_work() on it returns 0.  However, schedule_work() guarantees
> that the target work is executed at least once after the scheduling
> regardless of its return value.  0 return simply means that the work
> was already pending and thus no further action was required.

Hmm.. Is this really true ? IIRC, schedule_work() checks pending
work based on bit ops on work->pending and clear_bit() is not
a memory barrier. So, if I see work->pending = 1 in free_fdtable_work(), how
do I know that the work function is already executing and
missed the new work that I had queued ?


Thanks
Dipankar

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] file: kill unnecessary timer in fdtable_defer
  2006-08-21  4:32 ` Dipankar Sarma
@ 2006-08-21  5:18   ` Tejun Heo
  2006-08-21  7:58     ` Dipankar Sarma
  0 siblings, 1 reply; 7+ messages in thread
From: Tejun Heo @ 2006-08-21  5:18 UTC (permalink / raw)
  To: Dipankar Sarma; +Cc: viro, linux-kernel

On Mon, Aug 21, 2006 at 10:02:57AM +0530, Dipankar Sarma wrote:
> On Sun, Aug 20, 2006 at 10:15:42PM +0900, Tejun Heo wrote:
> > free_fdtable_rc() schedules timer to reschedule fddef->wq if
> > schedule_work() on it returns 0.  However, schedule_work() guarantees
> > that the target work is executed at least once after the scheduling
> > regardless of its return value.  0 return simply means that the work
> > was already pending and thus no further action was required.
> 
> Hmm.. Is this really true ? IIRC, schedule_work() checks pending
> work based on bit ops on work->pending and clear_bit() is not
> a memory barrier.

Those bitops are not memory barriers but they can define order between
them alright.  Once the execution order is correct, the rest of
synchronization is the caller's responsbility.  In fddef's case, it's
achieved via fddef->lock.

> So, if I see work->pending = 1 in free_fdtable_work(), how do I know
> that the work function is already executing and missed the new work
> that I had queued ?

This is classical edge-triggered event handling where each event
doesn't require separate handling.  The producer produces one or more
items at a time and the consumer consumes the whole queue on
invocation.

For example, when Pn indicates producing of item n and C indicates
consuming.  The seqeunce "P0 C P1 C P2 C" and "P0 P1 P2 C" are
effectively identical, so all that's required for correct operation
(that is, full queue consumption) is that the consumer is run at least
once after producing of the last item.

In workqueue, this is guaranteed by

1. If pending bit is set, the work is guaranteed to be executed in
   some future - it's on timer or queue.

2. The queuer sets the pending bit *after* producing a job to be
   done.

3. The worker clears the pending *before* executing the job.

I sometimes find it easier to understand with a diagram which looks
like the following.  Time flows from top to bottom.

          Queuer			  Worker

        -------------
       | produce job |
       |             |             <--- clr pending --->
        -------------                        |
              |                              v
              v                       --------------
    <--- set pending --->            | consume jobs |
				     |		    |
				      --------------

Now move the queuer and worker up and down in your mind.  If 'set
pending' is higher than clr pending 'consume job' is guaranteed to see
the job queuer has produced whether pending is set or clear
beforehand.  If 'set pending' is lower than 'clr pending', it is
guaranteed to set pending and schedule worker.  The workqueue is
straight-forward expansion where there are N queuers and a repeating
consumer.

-- 
tejun

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] file: kill unnecessary timer in fdtable_defer
  2006-08-21  5:18   ` Tejun Heo
@ 2006-08-21  7:58     ` Dipankar Sarma
  2006-08-21  8:24       ` Tejun Heo
  2006-09-22 15:46       ` Tejun Heo
  0 siblings, 2 replies; 7+ messages in thread
From: Dipankar Sarma @ 2006-08-21  7:58 UTC (permalink / raw)
  To: Tejun Heo; +Cc: viro, linux-kernel

On Mon, Aug 21, 2006 at 02:18:16PM +0900, Tejun Heo wrote:
> On Mon, Aug 21, 2006 at 10:02:57AM +0530, Dipankar Sarma wrote:
> > > regardless of its return value.  0 return simply means that the work
> > > was already pending and thus no further action was required.
> > 
> > Hmm.. Is this really true ? IIRC, schedule_work() checks pending
> > work based on bit ops on work->pending and clear_bit() is not
> > a memory barrier.
> 
> Those bitops are not memory barriers but they can define order between
> them alright.  Once the execution order is correct, the rest of

Huh ? If they are not memory barriers, they how can you guranttee
ordering on weakly ordered CPUs ?


> In workqueue, this is guaranteed by
> 
> 1. If pending bit is set, the work is guaranteed to be executed in
>    some future - it's on timer or queue.
> 
> 2. The queuer sets the pending bit *after* producing a job to be
>    done.
> 
> 3. The worker clears the pending *before* executing the job.
> 
> I sometimes find it easier to understand with a diagram which looks
> like the following.  Time flows from top to bottom.
> 
>           Queuer			  Worker
> 
>         -------------
>        | produce job |
>        |             |             <--- clr pending --->
>         -------------                        |
>               |                              v
>               v                       --------------
>     <--- set pending --->            | consume jobs |
> 				     |		    |
> 				      --------------
> 
> Now move the queuer and worker up and down in your mind.  If 'set
> pending' is higher than clr pending 'consume job' is guaranteed to see
> the job queuer has produced whether pending is set or clear
> beforehand.  If 'set pending' is lower than 'clr pending', it is
> guaranteed to set pending and schedule worker.  The workqueue is
> straight-forward expansion where there are N queuers and a repeating
> consumer.

Given that there is no smp_mb__after_clear_bit() after clearing
work->pending, what prevents the worker thread from seeing
the state of the deferred fd queue before setting the pending bit ?
IOW, the queuer sees pending = 1 and ignores waking the
worker thread, worker sees a stale state of the deferred fd queue
ignoring the newly queued work. That should be possible on
a cpu with weak memory ordering. Perhaps, we should fix
__queue_work() to add the smp_mb__after_clear_bit() and
make sure that we have a memory barrier after adding the
deferred fds.

Thanks
Dipankar



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] file: kill unnecessary timer in fdtable_defer
  2006-08-21  7:58     ` Dipankar Sarma
@ 2006-08-21  8:24       ` Tejun Heo
  2006-09-22 15:46       ` Tejun Heo
  1 sibling, 0 replies; 7+ messages in thread
From: Tejun Heo @ 2006-08-21  8:24 UTC (permalink / raw)
  To: Dipankar Sarma; +Cc: viro, linux-kernel

On Mon, Aug 21, 2006 at 01:28:18PM +0530, Dipankar Sarma wrote:
> On Mon, Aug 21, 2006 at 02:18:16PM +0900, Tejun Heo wrote:
> > On Mon, Aug 21, 2006 at 10:02:57AM +0530, Dipankar Sarma wrote:
> > > > regardless of its return value.  0 return simply means that the work
> > > > was already pending and thus no further action was required.
> > > 
> > > Hmm.. Is this really true ? IIRC, schedule_work() checks pending
> > > work based on bit ops on work->pending and clear_bit() is not
> > > a memory barrier.
> > 
> > Those bitops are not memory barriers but they can define order between
> > them alright.  Once the execution order is correct, the rest of
> 
> Huh ? If they are not memory barriers, they how can you guranttee
> ordering on weakly ordered CPUs ?

Atomic bitops define orders *between* them not *around* them.  ie. if
you have two atomic bitops on the same bit, they're ordered one way or
the other.  As the workqueue code currently stands, ordering around
the bitops is the caller's responsibility and fddef does it with a
spinlock.  As the producer and consumer usually access a common
resource, that kind of synchrnoization is inherent in most cases.

[--snip--]
> Given that there is no smp_mb__after_clear_bit() after clearing
> work->pending, what prevents the worker thread from seeing

Let me revise the diagram.

            Queuer                         Worker
 
             lock
         -------------             <--- clr pending --->
        | produce job |                       |
        |             |                       v
         -------------                  trying to lock
            unlock                            v
               |                        lock granted
               v                       --------------
     <--- set pending --->            | consume jobs |
                                      |              |
                                       --------------
                                           unlock

> the state of the deferred fd queue before setting the pending bit ?
> IOW, the queuer sees pending = 1 and ignores waking the
> worker thread, worker sees a stale state of the deferred fd queue
> ignoring the newly queued work. That should be possible on
> a cpu with weak memory ordering.

Queue being a shared resource, people usually synchronize around it.

> Perhaps, we should fix __queue_work() to add the
> smp_mb__after_clear_bit() and make sure that we have a memory
> barrier after adding the deferred fds.

That would help if the producer and consumer depend on memory ordering
for synchronization, which usually isn't the case.  I'm not sure
whether adding that is a good idea or not.  IMHO, it's better to use
explicit memory barriers with enough comments in such subtle cases to
make things clear.

Either way, fddef is just straight forward producer-consumer case with
no need for requeueing mechanism.  There is no need and no other user
does anything similar.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] file: kill unnecessary timer in fdtable_defer
  2006-08-21  7:58     ` Dipankar Sarma
  2006-08-21  8:24       ` Tejun Heo
@ 2006-09-22 15:46       ` Tejun Heo
  1 sibling, 0 replies; 7+ messages in thread
From: Tejun Heo @ 2006-09-22 15:46 UTC (permalink / raw)
  To: dipankar; +Cc: viro, linux-kernel

Hello, Dipankar.

It has been a month and this thread hasn't gotten anywhere.  If there is 
no further objection, I'll try to push it through -mm for testing.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH] file: kill unnecessary timer in fdtable_defer
@ 2006-10-17 11:16 Tejun Heo
  0 siblings, 0 replies; 7+ messages in thread
From: Tejun Heo @ 2006-10-17 11:16 UTC (permalink / raw)
  To: Andrew Morton; +Cc: viro, dipankar, linux-kernel

free_fdtable_rc() schedules timer to reschedule fddef->wq if
schedule_work() on it returns 0.  However, schedule_work() guarantees
that the target work is executed at least once after the scheduling
regardless of its return value.  0 return simply means that the work
was already pending and thus no further action was required.

Another problem is that it used contant '5' as @expires argument to
mod_timer().

Kill unnecessary fddef->timer.

Signed-off-by: Tejun Heo <htejun@gmail.com>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
---

This patch was first posted on August 20th.  The original thread can
be seen at the following URL.

http://thread.gmane.org/gmane.linux.kernel/438712/focus=438712

Dipankar objected but the discussion was dropped in the middle.
Please read the thread for more info.  I still think this patch is
correct.

Andrew, please push this patch through -mm if this looks correct.

Thanks.

 file.c |   29 ++---------------------------
  1 file changed, 2 insertions(+), 27 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index 8e81775..d91db2c 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -21,7 +21,6 @@ #include <linux/workqueue.h>
 struct fdtable_defer {
 	spinlock_t lock;
 	struct work_struct wq;
-	struct timer_list timer;
 	struct fdtable *next;
 };
 
@@ -75,22 +74,6 @@ static void __free_fdtable(struct fdtabl
 	kfree(fdt);
 }
 
-static void fdtable_timer(unsigned long data)
-{
-	struct fdtable_defer *fddef = (struct fdtable_defer *)data;
-
-	spin_lock(&fddef->lock);
-	/*
-	 * If someone already emptied the queue return.
-	 */
-	if (!fddef->next)
-		goto out;
-	if (!schedule_work(&fddef->wq))
-		mod_timer(&fddef->timer, 5);
-out:
-	spin_unlock(&fddef->lock);
-}
-
 static void free_fdtable_work(struct fdtable_defer *f)
 {
 	struct fdtable *fdt;
@@ -142,13 +125,8 @@ static void free_fdtable_rcu(struct rcu_
 		spin_lock(&fddef->lock);
 		fdt->next = fddef->next;
 		fddef->next = fdt;
-		/*
-		 * vmallocs are handled from the workqueue context.
-		 * If the per-cpu workqueue is running, then we
-		 * defer work scheduling through a timer.
-		 */
-		if (!schedule_work(&fddef->wq))
-			mod_timer(&fddef->timer, 5);
+		/* vmallocs are handled from the workqueue context */
+		schedule_work(&fddef->wq);
 		spin_unlock(&fddef->lock);
 		put_cpu_var(fdtable_defer_list);
 	}
@@ -352,9 +330,6 @@ static void __devinit fdtable_defer_list
 	struct fdtable_defer *fddef = &per_cpu(fdtable_defer_list, cpu);
 	spin_lock_init(&fddef->lock);
 	INIT_WORK(&fddef->wq, (void (*)(void *))free_fdtable_work, fddef);
-	init_timer(&fddef->timer);
-	fddef->timer.data = (unsigned long)fddef;
-	fddef->timer.function = fdtable_timer;
 	fddef->next = NULL;
 }
 

^ permalink raw reply related	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2006-10-17 11:16 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-08-20 13:15 [PATCH] file: kill unnecessary timer in fdtable_defer Tejun Heo
2006-08-21  4:32 ` Dipankar Sarma
2006-08-21  5:18   ` Tejun Heo
2006-08-21  7:58     ` Dipankar Sarma
2006-08-21  8:24       ` Tejun Heo
2006-09-22 15:46       ` Tejun Heo
  -- strict thread matches above, loose matches on Subject: below --
2006-10-17 11:16 Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox