public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH] task_work: remove fifo ordering guarantee
@ 2015-08-29 21:08 George Spelvin
  2015-08-31 13:22 ` Oleg Nesterov
  0 siblings, 1 reply; 20+ messages in thread
From: George Spelvin @ 2015-08-29 21:08 UTC (permalink / raw)
  To: oleg; +Cc: eric.dumazet, linux, linux-kernel, mingo

Oleg Nesterov wrote:
> On 08/29, Ingo Molnar wrote:
>> So I'm wondering, is there any strong reason why we couldn't use a double
>> linked list and still do FIFO and remove that silly linear list walking hack?
> 
> This will obviously enlarge callback_head, and it is often embedded.
> But this is minor.
> 
> If we use a double linked list we can't do task_work_add() lockless.
> So we will need another spinlock_t in task_struct. We can't use pi_lock.

You only need a singly linked list for FIFO, but indeed lockless
access is a pain.

For a LIFO stack, you just do a single compare-and-swap on the head.
Once an entry is in the list, it's immutable.

For FIFO, you only need one pointer in the nodes, but two in the list
head: a head pointer and a tail pointer.

The problem for lockless access is that you have to update both the next
pointer and the tail pointer, and without very specialized instructions
like 680x0's CAS2, there's no way to do them both atomically.

So the procedure to append (write) to the list is:
- Set your newly added node's next pointer to NULL.
- CAS the tail pointer.  This establishes your place in line,
  but does not make it visible to the reader.
- At this point, the next pointer is not yours any more and may be
  written by someone else, but the rest of the node may still be
  finished, if you like.
- But also, there's a sort of priority inversion problem.  If a writer
  stalls here, no following writer is visible to the reader.
- Then you update the previous tail (obtained from the CAS) to link via its
  next pointer to your newly added node.  This makes it visible to the reader.

- The reader, meanwhile, totally ignores the tail pointer, and
  only uses the head and next pointers.

Now, one trick with the above is that the tail pointer must *only*
be accessed by writers.  The single-threaded technique of representing
an empty queue by using a "struct **" tail pointer and having tail = &head
to represent an empty list is not okay.

Rather, the reader must never deallocate a node with a NULL next pointer.

If the nodes are large, you can minimize the overhead of this by having a
dedicated sentinel/dummy node which lives in the queue and gets moved to
the tail whenever it reaches the head *and* has a non-NULL next pointer.
But the reader may not depend on this node always being visible!

The reader might find the sentinel pointing to node A, and then move it to
the tail by doing a CAS on the tail pointer, but the tail pointer
by then points to B.  It's possible that A's next pointer is still NULL
and the writer adding B has not gotten to the second step yet.


Anyway, it can be done, but honestly the current "empty the stack in
batches and then reverse it" is simpler.

The cond_resched is ugly, but the overhead is minimal, and the FIFO
guarantee is convenient.


By the way, it's quite possible (LIFO or FIFO) for the writer to batch up
queue adds and only do one CAS to add an arbitrary number.

^ permalink raw reply	[flat|nested] 20+ messages in thread
* [PATCH] task_work: remove fifo ordering guarantee
@ 2015-08-29  2:42 Eric Dumazet
  2015-08-29  3:19 ` Linus Torvalds
  2015-08-29 12:49 ` Oleg Nesterov
  0 siblings, 2 replies; 20+ messages in thread
From: Eric Dumazet @ 2015-08-29  2:42 UTC (permalink / raw)
  To: Al Viro, Linus Torvalds
  Cc: linux-kernel@vger.kernel.org, Oleg Nesterov, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Maciej Żenczykowski

From: Eric Dumazet <edumazet@google.com>

In commit f341861fb0b ("task_work: add a scheduling point in
task_work_run()") I fixed a latency problem adding a cond_resched()
call.

Later, commit ac3d0da8f329 added yet another loop to reverse a list,
bringing back the latency spike :

I've seen in some cases this loop taking 275 ms, if for example a
process with 2,000,000 files is killed.

We could add yet another cond_resched() in the reverse loop, or we
can simply remove the reversal, as I do not think anything
would depend on order of task_work_add() submitted works.

Fixes: ac3d0da8f329 ("task_work: Make task_work_add() lockless")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Maciej Żenczykowski <maze@google.com>
---
 kernel/task_work.c |   12 ++----------
 1 file changed, 2 insertions(+), 10 deletions(-)

diff --git a/kernel/task_work.c b/kernel/task_work.c
index 8727032e3a6f..53fa971d000d 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -18,6 +18,8 @@ static struct callback_head work_exited; /* all we need is ->next == NULL */
  * This is like the signal handler which runs in kernel mode, but it doesn't
  * try to wake up the @task.
  *
+ * Note: there is no ordering guarantee on works queued here.
+ *
  * RETURNS:
  * 0 if succeeds or -ESRCH.
  */
@@ -108,16 +110,6 @@ void task_work_run(void)
 		raw_spin_unlock_wait(&task->pi_lock);
 		smp_mb();
 
-		/* Reverse the list to run the works in fifo order */
-		head = NULL;
-		do {
-			next = work->next;
-			work->next = head;
-			head = work;
-			work = next;
-		} while (work);
-
-		work = head;
 		do {
 			next = work->next;
 			work->func(work);



^ permalink raw reply related	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2015-09-05 20:46 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-08-29 21:08 [PATCH] task_work: remove fifo ordering guarantee George Spelvin
2015-08-31 13:22 ` Oleg Nesterov
2015-08-31 15:21   ` George Spelvin
  -- strict thread matches above, loose matches on Subject: below --
2015-08-29  2:42 Eric Dumazet
2015-08-29  3:19 ` Linus Torvalds
2015-08-29  9:22   ` Ingo Molnar
2015-08-29 12:54     ` Oleg Nesterov
2015-08-31  6:02       ` Ingo Molnar
2015-08-31 12:51         ` Oleg Nesterov
2015-08-29 12:49 ` Oleg Nesterov
2015-08-29 13:57   ` Eric Dumazet
2015-08-29 14:11     ` Eric Dumazet
2015-08-29 17:08       ` Linus Torvalds
2015-08-31  5:22         ` yalin wang
2015-09-05  5:19           ` Al Viro
2015-08-31 12:44         ` Oleg Nesterov
2015-09-05  5:12         ` Al Viro
2015-09-05  5:42           ` Al Viro
2015-09-05 20:46             ` Linus Torvalds
2015-09-05  5:35   ` Al Viro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox