* Re: [PATCH] task_work: remove fifo ordering guarantee
@ 2015-08-29 21:08 George Spelvin
2015-08-31 13:22 ` Oleg Nesterov
0 siblings, 1 reply; 20+ messages in thread
From: George Spelvin @ 2015-08-29 21:08 UTC (permalink / raw)
To: oleg; +Cc: eric.dumazet, linux, linux-kernel, mingo
Oleg Nesterov wrote:
> On 08/29, Ingo Molnar wrote:
>> So I'm wondering, is there any strong reason why we couldn't use a double
>> linked list and still do FIFO and remove that silly linear list walking hack?
>
> This will obviously enlarge callback_head, and it is often embedded.
> But this is minor.
>
> If we use a double linked list we can't do task_work_add() lockless.
> So we will need another spinlock_t in task_struct. We can't use pi_lock.
You only need a singly linked list for FIFO, but indeed lockless
access is a pain.
For a LIFO stack, you just do a single compare-and-swap on the head.
Once an entry is in the list, it's immutable.
For FIFO, you only need one pointer in the nodes, but two in the list
head: a head pointer and a tail pointer.
The problem for lockless access is that you have to update both the next
pointer and the tail pointer, and without very specialized instructions
like 680x0's CAS2, there's no way to do them both atomically.
So the procedure to append (write) to the list is:
- Set your newly added node's next pointer to NULL.
- CAS the tail pointer. This establishes your place in line,
but does not make it visible to the reader.
- At this point, the next pointer is not yours any more and may be
written by someone else, but the rest of the node may still be
finished, if you like.
- But also, there's a sort of priority inversion problem. If a writer
stalls here, no following writer is visible to the reader.
- Then you update the previous tail (obtained from the CAS) to link via its
next pointer to your newly added node. This makes it visible to the reader.
- The reader, meanwhile, totally ignores the tail pointer, and
only uses the head and next pointers.
Now, one trick with the above is that the tail pointer must *only*
be accessed by writers. The single-threaded technique of representing
an empty queue by using a "struct **" tail pointer and having tail = &head
to represent an empty list is not okay.
Rather, the reader must never deallocate a node with a NULL next pointer.
If the nodes are large, you can minimize the overhead of this by having a
dedicated sentinel/dummy node which lives in the queue and gets moved to
the tail whenever it reaches the head *and* has a non-NULL next pointer.
But the reader may not depend on this node always being visible!
The reader might find the sentinel pointing to node A, and then move it to
the tail by doing a CAS on the tail pointer, but the tail pointer
by then points to B. It's possible that A's next pointer is still NULL
and the writer adding B has not gotten to the second step yet.
Anyway, it can be done, but honestly the current "empty the stack in
batches and then reverse it" is simpler.
The cond_resched is ugly, but the overhead is minimal, and the FIFO
guarantee is convenient.
By the way, it's quite possible (LIFO or FIFO) for the writer to batch up
queue adds and only do one CAS to add an arbitrary number.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] task_work: remove fifo ordering guarantee
2015-08-29 21:08 [PATCH] task_work: remove fifo ordering guarantee George Spelvin
@ 2015-08-31 13:22 ` Oleg Nesterov
2015-08-31 15:21 ` George Spelvin
0 siblings, 1 reply; 20+ messages in thread
From: Oleg Nesterov @ 2015-08-31 13:22 UTC (permalink / raw)
To: George Spelvin; +Cc: eric.dumazet, linux-kernel, mingo
On 08/29, George Spelvin wrote:
>
> Oleg Nesterov wrote:
> > On 08/29, Ingo Molnar wrote:
> >> So I'm wondering, is there any strong reason why we couldn't use a double
> >> linked list and still do FIFO and remove that silly linear list walking hack?
> >
> > This will obviously enlarge callback_head, and it is often embedded.
> > But this is minor.
> >
> > If we use a double linked list we can't do task_work_add() lockless.
> > So we will need another spinlock_t in task_struct. We can't use pi_lock.
>
> You only need a singly linked list for FIFO, but indeed lockless
> access is a pain.
>
> For a LIFO stack, you just do a single compare-and-swap on the head.
> Once an entry is in the list, it's immutable.
>
> For FIFO, you only need one pointer in the nodes, but two in the list
> head: a head pointer and a tail pointer.
Actually you need a single tail pointer, See 158e1645e07f3e9f7e49.
But this doesn't matter.
> The problem for lockless access is that you have to update both the next
> pointer and the tail pointer, and without very specialized instructions
> like 680x0's CAS2, there's no way to do them both atomically.
>
> So the procedure to append (write) to the list is:
> ...
> - But also, there's a sort of priority inversion problem. If a writer
> stalls here, no following writer is visible to the reader.
And this also means that the next writer which does task_work_add() +
task_work_cancel() will be suprised. Worse, this means that work->func()
doesn't own its callback_head/container_of. The previous tail is visible
to task_work_run().
Perhaps I missed something. But to me this all looks too clever ;)
Personally I'd prefer to just add another spinlock_t.
But so far I hope we can keep this stupid but simple "reverse the list"
loop.
Oleg.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] task_work: remove fifo ordering guarantee
2015-08-31 13:22 ` Oleg Nesterov
@ 2015-08-31 15:21 ` George Spelvin
0 siblings, 0 replies; 20+ messages in thread
From: George Spelvin @ 2015-08-31 15:21 UTC (permalink / raw)
To: linux, oleg; +Cc: eric.dumazet, linux-kernel, mingo
Oleg Nesterov wrote:
> Actually you need a single tail pointer, See 158e1645e07f3e9f7e49.
> But this doesn't matter.
(This uses a circularly linked list, keeping a pointer to the
tail, and tail->next pointing to the head.)
Yes, if you're not trying to be lockless, that works quite well.
> And this also means that the next writer which does task_work_add() +
> task_work_cancel() will be suprised. Worse, this means that work->func()
> doesn't own its callback_head/container_of. The previous tail is visible
> to task_work_run().
I forgot about task_work_cancel()! Yes, supporting that as well would
be a problem, and make a messy algorithm even messier.
> Perhaps I missed something. But to me this all looks too clever ;)
> Personally I'd prefer to just add another spinlock_t.
>
> But so far I hope we can keep this stupid but simple "reverse the list"
> loop.
It *is* too complicated. That's why I described it: to discourage people
from implementing it.
^ permalink raw reply [flat|nested] 20+ messages in thread
* [PATCH] task_work: remove fifo ordering guarantee
@ 2015-08-29 2:42 Eric Dumazet
2015-08-29 3:19 ` Linus Torvalds
2015-08-29 12:49 ` Oleg Nesterov
0 siblings, 2 replies; 20+ messages in thread
From: Eric Dumazet @ 2015-08-29 2:42 UTC (permalink / raw)
To: Al Viro, Linus Torvalds
Cc: linux-kernel@vger.kernel.org, Oleg Nesterov, Andrew Morton,
Thomas Gleixner, Ingo Molnar, Maciej Żenczykowski
From: Eric Dumazet <edumazet@google.com>
In commit f341861fb0b ("task_work: add a scheduling point in
task_work_run()") I fixed a latency problem adding a cond_resched()
call.
Later, commit ac3d0da8f329 added yet another loop to reverse a list,
bringing back the latency spike :
I've seen in some cases this loop taking 275 ms, if for example a
process with 2,000,000 files is killed.
We could add yet another cond_resched() in the reverse loop, or we
can simply remove the reversal, as I do not think anything
would depend on order of task_work_add() submitted works.
Fixes: ac3d0da8f329 ("task_work: Make task_work_add() lockless")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Maciej Żenczykowski <maze@google.com>
---
kernel/task_work.c | 12 ++----------
1 file changed, 2 insertions(+), 10 deletions(-)
diff --git a/kernel/task_work.c b/kernel/task_work.c
index 8727032e3a6f..53fa971d000d 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -18,6 +18,8 @@ static struct callback_head work_exited; /* all we need is ->next == NULL */
* This is like the signal handler which runs in kernel mode, but it doesn't
* try to wake up the @task.
*
+ * Note: there is no ordering guarantee on works queued here.
+ *
* RETURNS:
* 0 if succeeds or -ESRCH.
*/
@@ -108,16 +110,6 @@ void task_work_run(void)
raw_spin_unlock_wait(&task->pi_lock);
smp_mb();
- /* Reverse the list to run the works in fifo order */
- head = NULL;
- do {
- next = work->next;
- work->next = head;
- head = work;
- work = next;
- } while (work);
-
- work = head;
do {
next = work->next;
work->func(work);
^ permalink raw reply related [flat|nested] 20+ messages in thread* Re: [PATCH] task_work: remove fifo ordering guarantee
2015-08-29 2:42 Eric Dumazet
@ 2015-08-29 3:19 ` Linus Torvalds
2015-08-29 9:22 ` Ingo Molnar
2015-08-29 12:49 ` Oleg Nesterov
1 sibling, 1 reply; 20+ messages in thread
From: Linus Torvalds @ 2015-08-29 3:19 UTC (permalink / raw)
To: Eric Dumazet
Cc: Al Viro, linux-kernel@vger.kernel.org, Oleg Nesterov,
Andrew Morton, Thomas Gleixner, Ingo Molnar,
Maciej Żenczykowski
On Fri, Aug 28, 2015 at 7:42 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> We could add yet another cond_resched() in the reverse loop, or we
> can simply remove the reversal, as I do not think anything
> would depend on order of task_work_add() submitted works.
So I think this should be ok, with things like file closing not really
caring about ordering as far as I can tell.
However, has anybody gone through all the task-work users? I looked
quickly at the task_work_add() cases, and didn't see anything that
looked like it would care, but others should look too. In the vfs,
theres' the delayed fput and mnt freeing, and there's a keyring
installation one.
The threaded irq handlers use it as that exit-time hack, which
certainly shouldn't care, and there's some uprobe thing.
Can anybody see anything fishy?
Linus
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [PATCH] task_work: remove fifo ordering guarantee
2015-08-29 3:19 ` Linus Torvalds
@ 2015-08-29 9:22 ` Ingo Molnar
2015-08-29 12:54 ` Oleg Nesterov
0 siblings, 1 reply; 20+ messages in thread
From: Ingo Molnar @ 2015-08-29 9:22 UTC (permalink / raw)
To: Linus Torvalds
Cc: Eric Dumazet, Al Viro, linux-kernel@vger.kernel.org,
Oleg Nesterov, Andrew Morton, Thomas Gleixner,
Maciej Żenczykowski
* Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Fri, Aug 28, 2015 at 7:42 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >
> > We could add yet another cond_resched() in the reverse loop, or we can simply
> > remove the reversal, as I do not think anything would depend on order of
> > task_work_add() submitted works.
>
> So I think this should be ok, with things like file closing not really caring
> about ordering as far as I can tell.
>
> However, has anybody gone through all the task-work users? I looked quickly at
> the task_work_add() cases, and didn't see anything that looked like it would
> care, but others should look too. In the vfs, theres' the delayed fput and mnt
> freeing, and there's a keyring installation one.
>
> The threaded irq handlers use it as that exit-time hack, which certainly
> shouldn't care, and there's some uprobe thing.
>
> Can anybody see anything fishy?
So I'm wondering, is there any strong reason why we couldn't use a double linked
list and still do FIFO and remove that silly linear list walking hack?
Thanks,
Ingo
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] task_work: remove fifo ordering guarantee
2015-08-29 9:22 ` Ingo Molnar
@ 2015-08-29 12:54 ` Oleg Nesterov
2015-08-31 6:02 ` Ingo Molnar
0 siblings, 1 reply; 20+ messages in thread
From: Oleg Nesterov @ 2015-08-29 12:54 UTC (permalink / raw)
To: Ingo Molnar
Cc: Linus Torvalds, Eric Dumazet, Al Viro,
linux-kernel@vger.kernel.org, Andrew Morton, Thomas Gleixner,
Maciej Żenczykowski
On 08/29, Ingo Molnar wrote:
>
> So I'm wondering, is there any strong reason why we couldn't use a double linked
> list and still do FIFO and remove that silly linear list walking hack?
This will obviously enlarge callback_head, and it is often embedded.
But this is minor.
If we use a double linked list we can't do task_work_add() lockless.
So we will need another spinlock_t in task_struct. We can't use pi_lock.
Oleg.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] task_work: remove fifo ordering guarantee
2015-08-29 12:54 ` Oleg Nesterov
@ 2015-08-31 6:02 ` Ingo Molnar
2015-08-31 12:51 ` Oleg Nesterov
0 siblings, 1 reply; 20+ messages in thread
From: Ingo Molnar @ 2015-08-31 6:02 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Linus Torvalds, Eric Dumazet, Al Viro,
linux-kernel@vger.kernel.org, Andrew Morton, Thomas Gleixner,
Maciej Żenczykowski
* Oleg Nesterov <oleg@redhat.com> wrote:
> On 08/29, Ingo Molnar wrote:
> >
> > So I'm wondering, is there any strong reason why we couldn't use a double linked
> > list and still do FIFO and remove that silly linear list walking hack?
>
> This will obviously enlarge callback_head, and it is often embedded.
> But this is minor.
>
> If we use a double linked list we can't do task_work_add() lockless.
> So we will need another spinlock_t in task_struct. We can't use pi_lock.
The fact that the O(N) overhead was measured in real apps to be in the
milliseconds IMHO weakens cycle-level concerns about also having a spinlock next
to the list head. (There's no additional cacheline bouncing concerns with the
spinlock: the head of a LIFO list is essentially a bouncing cacheline.)
If there's some other solution, sure, but LIFO queues tend to be trouble down the
line.
Thanks,
Ingo
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] task_work: remove fifo ordering guarantee
2015-08-31 6:02 ` Ingo Molnar
@ 2015-08-31 12:51 ` Oleg Nesterov
0 siblings, 0 replies; 20+ messages in thread
From: Oleg Nesterov @ 2015-08-31 12:51 UTC (permalink / raw)
To: Ingo Molnar
Cc: Linus Torvalds, Eric Dumazet, Al Viro,
linux-kernel@vger.kernel.org, Andrew Morton, Thomas Gleixner,
Maciej Żenczykowski
On 08/31, Ingo Molnar wrote:
>
> * Oleg Nesterov <oleg@redhat.com> wrote:
>
> > On 08/29, Ingo Molnar wrote:
> > >
> > > So I'm wondering, is there any strong reason why we couldn't use a double linked
> > > list and still do FIFO and remove that silly linear list walking hack?
> >
> > This will obviously enlarge callback_head, and it is often embedded.
> > But this is minor.
> >
> > If we use a double linked list we can't do task_work_add() lockless.
> > So we will need another spinlock_t in task_struct. We can't use pi_lock.
>
> The fact that the O(N) overhead was measured in real apps to be in the
> milliseconds IMHO weakens cycle-level concerns about also having a spinlock next
> to the list head. (There's no additional cacheline bouncing concerns with the
> spinlock: the head of a LIFO list is essentially a bouncing cacheline.)
I agree. I just tried to explain that we need a bit more changes than
just s/callback_head/list_head/ in task_struct.
And. The fact that this O(N) overhead was measured means that we have
more overhead with offload-fput-to-exit_task_work which would be nice
to remove as well.
Oleg.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] task_work: remove fifo ordering guarantee
2015-08-29 2:42 Eric Dumazet
2015-08-29 3:19 ` Linus Torvalds
@ 2015-08-29 12:49 ` Oleg Nesterov
2015-08-29 13:57 ` Eric Dumazet
2015-09-05 5:35 ` Al Viro
1 sibling, 2 replies; 20+ messages in thread
From: Oleg Nesterov @ 2015-08-29 12:49 UTC (permalink / raw)
To: Eric Dumazet
Cc: Al Viro, Linus Torvalds, linux-kernel@vger.kernel.org,
Andrew Morton, Thomas Gleixner, Ingo Molnar,
Maciej Żenczykowski
On 08/28, Eric Dumazet wrote:
>
> From: Eric Dumazet <edumazet@google.com>
>
> In commit f341861fb0b ("task_work: add a scheduling point in
> task_work_run()") I fixed a latency problem adding a cond_resched()
> call.
>
> Later, commit ac3d0da8f329 added yet another loop to reverse a list,
> bringing back the latency spike :
>
> I've seen in some cases this loop taking 275 ms, if for example a
> process with 2,000,000 files is killed.
>
> We could add yet another cond_resched() in the reverse loop,
Can't we do this?
> or we
> can simply remove the reversal, as I do not think anything
> would depend on order of task_work_add() submitted works.
Personally I'd prefer to keep the fifo ordering. It just makes
more sense imho. Even if currently nobody depends on it (although
I am not sure about out-of-tree modules, say, systemtap).
Let's look keyctl_session_to_parent(). It does task_work_cancel()
but only because we can not trust user-space. Otherwise we could
remove it and just do task_work_add(), but this needs fifo.
Fifo just looks more sane to me.
Oleg.
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [PATCH] task_work: remove fifo ordering guarantee
2015-08-29 12:49 ` Oleg Nesterov
@ 2015-08-29 13:57 ` Eric Dumazet
2015-08-29 14:11 ` Eric Dumazet
2015-09-05 5:35 ` Al Viro
1 sibling, 1 reply; 20+ messages in thread
From: Eric Dumazet @ 2015-08-29 13:57 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Al Viro, Linus Torvalds, linux-kernel@vger.kernel.org,
Andrew Morton, Thomas Gleixner, Ingo Molnar,
Maciej Żenczykowski
On Sat, 2015-08-29 at 14:49 +0200, Oleg Nesterov wrote:
> On 08/28, Eric Dumazet wrote:
> >
> > From: Eric Dumazet <edumazet@google.com>
> >
> > In commit f341861fb0b ("task_work: add a scheduling point in
> > task_work_run()") I fixed a latency problem adding a cond_resched()
> > call.
> >
> > Later, commit ac3d0da8f329 added yet another loop to reverse a list,
> > bringing back the latency spike :
> >
> > I've seen in some cases this loop taking 275 ms, if for example a
> > process with 2,000,000 files is killed.
> >
> > We could add yet another cond_resched() in the reverse loop,
>
> Can't we do this?
Well, I stated in the changelog we could. Obviously we can.
Adding 275 ms of pure overhead to perform this list reversal for files
to be closed is quite unfortunate.
> Personally I'd prefer to keep the fifo ordering. It just makes
> more sense imho. Even if currently nobody depends on it (although
> I am not sure about out-of-tree modules, say, systemtap).
>
> Let's look keyctl_session_to_parent(). It does task_work_cancel()
> but only because we can not trust user-space. Otherwise we could
> remove it and just do task_work_add(), but this needs fifo.
So it looks like there is no problem today, right, other than the
possibility to parse a long list while blocking IRQ ?
>
> Fifo just looks more sane to me.
Well, files are closed in a random order. These are the main user of
this stuff.
If this is that critical, maybe use 2 lists, one for stuff needing fifo,
and another one for un-ordered stuff (ed : file closing), and add a
boolean to task_work_add()/task_work_cancel(). This adds yet another
field into struct task_struct.
Now we also could question why we needed commit
4a9d4b024a3102fc083c925c242d98ac27b1c5f6 ("switch fput to task_work_add
") since it seems quite an overhead at task exit with 10^6 of files to
close.
I understood the 'schedule_work() for interrupt/kernel_thread callers'
part, but not the task_work_add() one.
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [PATCH] task_work: remove fifo ordering guarantee
2015-08-29 13:57 ` Eric Dumazet
@ 2015-08-29 14:11 ` Eric Dumazet
2015-08-29 17:08 ` Linus Torvalds
0 siblings, 1 reply; 20+ messages in thread
From: Eric Dumazet @ 2015-08-29 14:11 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Al Viro, Linus Torvalds, linux-kernel@vger.kernel.org,
Andrew Morton, Thomas Gleixner, Ingo Molnar,
Maciej Żenczykowski
On Sat, 2015-08-29 at 06:57 -0700, Eric Dumazet wrote:
> Now we also could question why we needed commit
> 4a9d4b024a3102fc083c925c242d98ac27b1c5f6 ("switch fput to task_work_add
> ") since it seems quite an overhead at task exit with 10^6 of files to
> close.
>
> I understood the 'schedule_work() for interrupt/kernel_thread callers'
> part, but not the task_work_add() one.
If this needs to be kept, maybe then add following, to make sure
we flush the list at most every BITS_PER_LONG files
diff --git a/fs/file.c b/fs/file.c
index 6c672ad329e9..f3d0a79cef05 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -22,6 +22,7 @@
#include <linux/spinlock.h>
#include <linux/rcupdate.h>
#include <linux/workqueue.h>
+#include <linux/task_work.h>
int sysctl_nr_open __read_mostly = 1024*1024;
int sysctl_nr_open_min = BITS_PER_LONG;
@@ -392,6 +393,7 @@ static struct fdtable *close_files(struct
files_struct * files)
i++;
set >>= 1;
}
+ task_work_run();
}
return fdt;
^ permalink raw reply related [flat|nested] 20+ messages in thread* Re: [PATCH] task_work: remove fifo ordering guarantee
2015-08-29 14:11 ` Eric Dumazet
@ 2015-08-29 17:08 ` Linus Torvalds
2015-08-31 5:22 ` yalin wang
` (2 more replies)
0 siblings, 3 replies; 20+ messages in thread
From: Linus Torvalds @ 2015-08-29 17:08 UTC (permalink / raw)
To: Eric Dumazet
Cc: Oleg Nesterov, Al Viro, linux-kernel@vger.kernel.org,
Andrew Morton, Thomas Gleixner, Ingo Molnar,
Maciej Żenczykowski
On Sat, Aug 29, 2015 at 7:11 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> If this needs to be kept, maybe then add following, to make sure
> we flush the list at most every BITS_PER_LONG files
Hmm.
I'm wondering if we should just make close_files() (or maybe even
filp_close()) use a synchronous fput().
Iirc, the reason we delay fput() is that we had some nasty issues for
the generic fput case. It was called from interrupt context by the aio
code, and just in general there's a lot of nasty cases that can cause
the final fput to happen (so there are lockdep issues with the mmap
locks because the last fput being from munmap etc).
Maybe I forget some detail - it's been several years by now - but I
think we could make the regular "close()" and "exit()" cases just use
the synchronous fput (it's called "__fput_sync()" and currently
explicitly limited to just kernel threads).
Al?
Because it feels all kinds of stupid to add things to the task-work
queue just to then remove it almost immediately again. And
close_files() is also called from various contexts. but the whole "put
the final 'files_struct' case is certainly not at all as special as
the 'put the final file'.
Linus
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: [PATCH] task_work: remove fifo ordering guarantee
2015-08-29 17:08 ` Linus Torvalds
@ 2015-08-31 5:22 ` yalin wang
2015-09-05 5:19 ` Al Viro
2015-08-31 12:44 ` Oleg Nesterov
2015-09-05 5:12 ` Al Viro
2 siblings, 1 reply; 20+ messages in thread
From: yalin wang @ 2015-08-31 5:22 UTC (permalink / raw)
To: Linus Torvalds
Cc: Eric Dumazet, Oleg Nesterov, Al Viro,
linux-kernel@vger.kernel.org, Andrew Morton, Thomas Gleixner,
Ingo Molnar, Maciej Żenczykowski
> On Aug 30, 2015, at 01:08, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> On Sat, Aug 29, 2015 at 7:11 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>
>> If this needs to be kept, maybe then add following, to make sure
>> we flush the list at most every BITS_PER_LONG files
>
> Hmm.
>
> I'm wondering if we should just make close_files() (or maybe even
> filp_close()) use a synchronous fput().
>
> Iirc, the reason we delay fput() is that we had some nasty issues for
> the generic fput case. It was called from interrupt context by the aio
> code, and just in general there's a lot of nasty cases that can cause
> the final fput to happen (so there are lockdep issues with the mmap
> locks because the last fput being from munmap etc).
>
> Maybe I forget some detail - it's been several years by now - but I
> think we could make the regular "close()" and "exit()" cases just use
> the synchronous fput (it's called "__fput_sync()" and currently
> explicitly limited to just kernel threads).
>
> Al?
>
> Because it feels all kinds of stupid to add things to the task-work
> queue just to then remove it almost immediately again. And
> close_files() is also called from various contexts. but the whole "put
> the final 'files_struct' case is certainly not at all as special as
> the 'put the final file'.
>
> Linus
> —
why not provide API like:
fput()
fput_nosync() ?
because synchronous version are reasonable and safe in most time,
let the user to select which version to use is more feasible, no matter if it is kthread or not.
Thanks
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] task_work: remove fifo ordering guarantee
2015-08-31 5:22 ` yalin wang
@ 2015-09-05 5:19 ` Al Viro
0 siblings, 0 replies; 20+ messages in thread
From: Al Viro @ 2015-09-05 5:19 UTC (permalink / raw)
To: yalin wang
Cc: Linus Torvalds, Eric Dumazet, Oleg Nesterov,
linux-kernel@vger.kernel.org, Andrew Morton, Thomas Gleixner,
Ingo Molnar, Maciej Żenczykowski
On Mon, Aug 31, 2015 at 01:22:26PM +0800, yalin wang wrote:
> why not provide API like:
> fput()
> fput_nosync() ?
>
> because synchronous version are reasonable and safe in most time,
> let the user to select which version to use is more feasible, no matter if it is kthread or not.
Synchronous version is *NOT* safe in a lot of situations, from "deep enough
in kernel stack" to "now a function seven levels out in call chain happens
to hold a mutex grabbed elsewhere inside a mutex taken by unexpected
->release() instance, causing a deadlock", etc.
It's not sync vs. async; we still guarantee execution before return from
syscall. The only case when we really get async is kernel threads - there
we do *not* return to userland at all, so we have to schedule it really
asynchronous. Which is why we need an explicit sync version (for kernel
threads only, not exported, don't use unless you really understand what
you are doing and can explain why that particular case is safe, etc.)
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] task_work: remove fifo ordering guarantee
2015-08-29 17:08 ` Linus Torvalds
2015-08-31 5:22 ` yalin wang
@ 2015-08-31 12:44 ` Oleg Nesterov
2015-09-05 5:12 ` Al Viro
2 siblings, 0 replies; 20+ messages in thread
From: Oleg Nesterov @ 2015-08-31 12:44 UTC (permalink / raw)
To: Linus Torvalds
Cc: Eric Dumazet, Al Viro, linux-kernel@vger.kernel.org,
Andrew Morton, Thomas Gleixner, Ingo Molnar,
Maciej Żenczykowski
On 08/29, Linus Torvalds wrote:
>
> On Sat, Aug 29, 2015 at 7:11 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >
> > If this needs to be kept, maybe then add following, to make sure
> > we flush the list at most every BITS_PER_LONG files
>
> Hmm.
>
> I'm wondering if we should just make close_files() (or maybe even
> filp_close()) use a synchronous fput().
Heh. I thought about the same change. So perhaps it is even the right
thing to do. Still I am worried, because "it can't be that simple" ;)
And, with this change close_files() is called before exit_fs() and
exit_task_namespaces(). This is fine (iiuc), but this means that the
creative code in drivers/ can (wrongly) rely on this fact again. IIRC,
the change which moved __fput() into task_work_exit() uncovered some
interesting problems, like filp_open() called from fop->release().
Anyway, this is the question to Al, I guess.
Oleg.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] task_work: remove fifo ordering guarantee
2015-08-29 17:08 ` Linus Torvalds
2015-08-31 5:22 ` yalin wang
2015-08-31 12:44 ` Oleg Nesterov
@ 2015-09-05 5:12 ` Al Viro
2015-09-05 5:42 ` Al Viro
2 siblings, 1 reply; 20+ messages in thread
From: Al Viro @ 2015-09-05 5:12 UTC (permalink / raw)
To: Linus Torvalds
Cc: Eric Dumazet, Oleg Nesterov, linux-kernel@vger.kernel.org,
Andrew Morton, Thomas Gleixner, Ingo Molnar,
Maciej Żenczykowski
On Sat, Aug 29, 2015 at 10:08:30AM -0700, Linus Torvalds wrote:
> Hmm.
>
> I'm wondering if we should just make close_files() (or maybe even
> filp_close()) use a synchronous fput().
>
> Iirc, the reason we delay fput() is that we had some nasty issues for
> the generic fput case. It was called from interrupt context by the aio
> code, and just in general there's a lot of nasty cases that can cause
> the final fput to happen (so there are lockdep issues with the mmap
> locks because the last fput being from munmap etc).
>
> Maybe I forget some detail - it's been several years by now - but I
> think we could make the regular "close()" and "exit()" cases just use
> the synchronous fput (it's called "__fput_sync()" and currently
> explicitly limited to just kernel threads).
>
> Al?
First of all, we'd better not count on e.g. delayed fput() *NOT* doing
task_work_add() - we still need to check if any new work had been added.
After all, final close() might very well have done a final mntput()
on a lazy-unmounted filesystem, possibly leaving us with fs shutdown via
task_work_add(). And if that sucker e.g. closes a socket, well, we are
back to closing an opened struct file, with task_work_add() etc.
I'm a bit nervious about filp_close() (that sucker is exported and widely
abused), but close_files()... sure, shouldn't be a problem. And yes,
we can teach __close_fd() to do the same. I really don't understand what's
the benefit, though - it's about the case when we are closing the last
descriptor for given opened file, so I would be rather surprised if slower
path taken on the way out to userland was not lost in noise...
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] task_work: remove fifo ordering guarantee
2015-09-05 5:12 ` Al Viro
@ 2015-09-05 5:42 ` Al Viro
2015-09-05 20:46 ` Linus Torvalds
0 siblings, 1 reply; 20+ messages in thread
From: Al Viro @ 2015-09-05 5:42 UTC (permalink / raw)
To: Linus Torvalds
Cc: Eric Dumazet, Oleg Nesterov, linux-kernel@vger.kernel.org,
Andrew Morton, Thomas Gleixner, Ingo Molnar,
Maciej Żenczykowski
On Sat, Sep 05, 2015 at 06:12:34AM +0100, Al Viro wrote:
> First of all, we'd better not count on e.g. delayed fput() *NOT* doing
> task_work_add() - we still need to check if any new work had been added.
> After all, final close() might very well have done a final mntput()
> on a lazy-unmounted filesystem, possibly leaving us with fs shutdown via
> task_work_add(). And if that sucker e.g. closes a socket, well, we are
> back to closing an opened struct file, with task_work_add() etc.
>
> I'm a bit nervious about filp_close() (that sucker is exported and widely
> abused), but close_files()... sure, shouldn't be a problem. And yes,
> we can teach __close_fd() to do the same. I really don't understand what's
> the benefit, though - it's about the case when we are closing the last
> descriptor for given opened file, so I would be rather surprised if slower
> path taken on the way out to userland was not lost in noise...
OK, having found the beginning of the thread, I understand what is being
attempted, but... why the hell bother with FIFO in the first place? AFAICS,
task_work_add() uses in VFS (final fput() and final mntput() alike)
do not care about the FIFO at all.
Sure, some out-of-tree mer^H^Hodule might rely on that. So what?
IMO, unless we have a good in-tree reason for insisting on FIFO, dropping it
is the most obvious solution...
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] task_work: remove fifo ordering guarantee
2015-09-05 5:42 ` Al Viro
@ 2015-09-05 20:46 ` Linus Torvalds
0 siblings, 0 replies; 20+ messages in thread
From: Linus Torvalds @ 2015-09-05 20:46 UTC (permalink / raw)
To: Al Viro
Cc: Eric Dumazet, Oleg Nesterov, linux-kernel@vger.kernel.org,
Andrew Morton, Thomas Gleixner, Ingo Molnar,
Maciej Żenczykowski
On Fri, Sep 4, 2015 at 10:42 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> OK, having found the beginning of the thread, I understand what is being
> attempted, but... why the hell bother with FIFO in the first place? AFAICS,
> task_work_add() uses in VFS (final fput() and final mntput() alike)
> do not care about the FIFO at all.
>
> Sure, some out-of-tree mer^H^Hodule might rely on that. So what?
>
> IMO, unless we have a good in-tree reason for insisting on FIFO, dropping it
> is the most obvious solution...
I agree. We should just try that.
I'll apply Eric's patch from the beginning of this tree, and let's
just see if anybody ever notices.
Removing code and possibly fixing a latency issue sounds like a win-win.
Linus
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] task_work: remove fifo ordering guarantee
2015-08-29 12:49 ` Oleg Nesterov
2015-08-29 13:57 ` Eric Dumazet
@ 2015-09-05 5:35 ` Al Viro
1 sibling, 0 replies; 20+ messages in thread
From: Al Viro @ 2015-09-05 5:35 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Eric Dumazet, Linus Torvalds, linux-kernel@vger.kernel.org,
Andrew Morton, Thomas Gleixner, Ingo Molnar,
Maciej Żenczykowski
On Sat, Aug 29, 2015 at 02:49:21PM +0200, Oleg Nesterov wrote:
> Let's look keyctl_session_to_parent(). It does task_work_cancel()
> but only because we can not trust user-space. Otherwise we could
> remove it and just do task_work_add(), but this needs fifo.
>
> Fifo just looks more sane to me.
Not if it costs us. As far as files closing is concerned, the order
really doesn't matter. Ditto for final mntput() uses of that stuff.
What *does* matter is task_work_add() issued by callback not getting lost.
IMO the obvious solution is to lose the reordering...
^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2015-09-05 20:46 UTC | newest]
Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-08-29 21:08 [PATCH] task_work: remove fifo ordering guarantee George Spelvin
2015-08-31 13:22 ` Oleg Nesterov
2015-08-31 15:21 ` George Spelvin
-- strict thread matches above, loose matches on Subject: below --
2015-08-29 2:42 Eric Dumazet
2015-08-29 3:19 ` Linus Torvalds
2015-08-29 9:22 ` Ingo Molnar
2015-08-29 12:54 ` Oleg Nesterov
2015-08-31 6:02 ` Ingo Molnar
2015-08-31 12:51 ` Oleg Nesterov
2015-08-29 12:49 ` Oleg Nesterov
2015-08-29 13:57 ` Eric Dumazet
2015-08-29 14:11 ` Eric Dumazet
2015-08-29 17:08 ` Linus Torvalds
2015-08-31 5:22 ` yalin wang
2015-09-05 5:19 ` Al Viro
2015-08-31 12:44 ` Oleg Nesterov
2015-09-05 5:12 ` Al Viro
2015-09-05 5:42 ` Al Viro
2015-09-05 20:46 ` Linus Torvalds
2015-09-05 5:35 ` Al Viro
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox