* [PATCH] task_work: remove fifo ordering guarantee
@ 2015-08-29 2:42 Eric Dumazet
2015-08-29 3:19 ` Linus Torvalds
2015-08-29 12:49 ` Oleg Nesterov
0 siblings, 2 replies; 20+ messages in thread
From: Eric Dumazet @ 2015-08-29 2:42 UTC (permalink / raw)
To: Al Viro, Linus Torvalds
Cc: linux-kernel@vger.kernel.org, Oleg Nesterov, Andrew Morton,
Thomas Gleixner, Ingo Molnar, Maciej Żenczykowski
From: Eric Dumazet <edumazet@google.com>
In commit f341861fb0b ("task_work: add a scheduling point in
task_work_run()") I fixed a latency problem adding a cond_resched()
call.
Later, commit ac3d0da8f329 added yet another loop to reverse a list,
bringing back the latency spike :
I've seen in some cases this loop taking 275 ms, if for example a
process with 2,000,000 files is killed.
We could add yet another cond_resched() in the reverse loop, or we
can simply remove the reversal, as I do not think anything
would depend on order of task_work_add() submitted works.
Fixes: ac3d0da8f329 ("task_work: Make task_work_add() lockless")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Maciej Żenczykowski <maze@google.com>
---
kernel/task_work.c | 12 ++----------
1 file changed, 2 insertions(+), 10 deletions(-)
diff --git a/kernel/task_work.c b/kernel/task_work.c
index 8727032e3a6f..53fa971d000d 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -18,6 +18,8 @@ static struct callback_head work_exited; /* all we need is ->next == NULL */
* This is like the signal handler which runs in kernel mode, but it doesn't
* try to wake up the @task.
*
+ * Note: there is no ordering guarantee on works queued here.
+ *
* RETURNS:
* 0 if succeeds or -ESRCH.
*/
@@ -108,16 +110,6 @@ void task_work_run(void)
raw_spin_unlock_wait(&task->pi_lock);
smp_mb();
- /* Reverse the list to run the works in fifo order */
- head = NULL;
- do {
- next = work->next;
- work->next = head;
- head = work;
- work = next;
- } while (work);
-
- work = head;
do {
next = work->next;
work->func(work);
^ permalink raw reply related [flat|nested] 20+ messages in thread* Re: [PATCH] task_work: remove fifo ordering guarantee 2015-08-29 2:42 [PATCH] task_work: remove fifo ordering guarantee Eric Dumazet @ 2015-08-29 3:19 ` Linus Torvalds 2015-08-29 9:22 ` Ingo Molnar 2015-08-29 12:49 ` Oleg Nesterov 1 sibling, 1 reply; 20+ messages in thread From: Linus Torvalds @ 2015-08-29 3:19 UTC (permalink / raw) To: Eric Dumazet Cc: Al Viro, linux-kernel@vger.kernel.org, Oleg Nesterov, Andrew Morton, Thomas Gleixner, Ingo Molnar, Maciej Żenczykowski On Fri, Aug 28, 2015 at 7:42 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > > We could add yet another cond_resched() in the reverse loop, or we > can simply remove the reversal, as I do not think anything > would depend on order of task_work_add() submitted works. So I think this should be ok, with things like file closing not really caring about ordering as far as I can tell. However, has anybody gone through all the task-work users? I looked quickly at the task_work_add() cases, and didn't see anything that looked like it would care, but others should look too. In the vfs, theres' the delayed fput and mnt freeing, and there's a keyring installation one. The threaded irq handlers use it as that exit-time hack, which certainly shouldn't care, and there's some uprobe thing. Can anybody see anything fishy? Linus ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] task_work: remove fifo ordering guarantee 2015-08-29 3:19 ` Linus Torvalds @ 2015-08-29 9:22 ` Ingo Molnar 2015-08-29 12:54 ` Oleg Nesterov 0 siblings, 1 reply; 20+ messages in thread From: Ingo Molnar @ 2015-08-29 9:22 UTC (permalink / raw) To: Linus Torvalds Cc: Eric Dumazet, Al Viro, linux-kernel@vger.kernel.org, Oleg Nesterov, Andrew Morton, Thomas Gleixner, Maciej Żenczykowski * Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Fri, Aug 28, 2015 at 7:42 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > > > > We could add yet another cond_resched() in the reverse loop, or we can simply > > remove the reversal, as I do not think anything would depend on order of > > task_work_add() submitted works. > > So I think this should be ok, with things like file closing not really caring > about ordering as far as I can tell. > > However, has anybody gone through all the task-work users? I looked quickly at > the task_work_add() cases, and didn't see anything that looked like it would > care, but others should look too. In the vfs, theres' the delayed fput and mnt > freeing, and there's a keyring installation one. > > The threaded irq handlers use it as that exit-time hack, which certainly > shouldn't care, and there's some uprobe thing. > > Can anybody see anything fishy? So I'm wondering, is there any strong reason why we couldn't use a double linked list and still do FIFO and remove that silly linear list walking hack? Thanks, Ingo ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] task_work: remove fifo ordering guarantee 2015-08-29 9:22 ` Ingo Molnar @ 2015-08-29 12:54 ` Oleg Nesterov 2015-08-31 6:02 ` Ingo Molnar 0 siblings, 1 reply; 20+ messages in thread From: Oleg Nesterov @ 2015-08-29 12:54 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Eric Dumazet, Al Viro, linux-kernel@vger.kernel.org, Andrew Morton, Thomas Gleixner, Maciej Żenczykowski On 08/29, Ingo Molnar wrote: > > So I'm wondering, is there any strong reason why we couldn't use a double linked > list and still do FIFO and remove that silly linear list walking hack? This will obviously enlarge callback_head, and it is often embedded. But this is minor. If we use a double linked list we can't do task_work_add() lockless. So we will need another spinlock_t in task_struct. We can't use pi_lock. Oleg. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] task_work: remove fifo ordering guarantee 2015-08-29 12:54 ` Oleg Nesterov @ 2015-08-31 6:02 ` Ingo Molnar 2015-08-31 12:51 ` Oleg Nesterov 0 siblings, 1 reply; 20+ messages in thread From: Ingo Molnar @ 2015-08-31 6:02 UTC (permalink / raw) To: Oleg Nesterov Cc: Linus Torvalds, Eric Dumazet, Al Viro, linux-kernel@vger.kernel.org, Andrew Morton, Thomas Gleixner, Maciej Żenczykowski * Oleg Nesterov <oleg@redhat.com> wrote: > On 08/29, Ingo Molnar wrote: > > > > So I'm wondering, is there any strong reason why we couldn't use a double linked > > list and still do FIFO and remove that silly linear list walking hack? > > This will obviously enlarge callback_head, and it is often embedded. > But this is minor. > > If we use a double linked list we can't do task_work_add() lockless. > So we will need another spinlock_t in task_struct. We can't use pi_lock. The fact that the O(N) overhead was measured in real apps to be in the milliseconds IMHO weakens cycle-level concerns about also having a spinlock next to the list head. (There's no additional cacheline bouncing concerns with the spinlock: the head of a LIFO list is essentially a bouncing cacheline.) If there's some other solution, sure, but LIFO queues tend to be trouble down the line. Thanks, Ingo ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] task_work: remove fifo ordering guarantee 2015-08-31 6:02 ` Ingo Molnar @ 2015-08-31 12:51 ` Oleg Nesterov 0 siblings, 0 replies; 20+ messages in thread From: Oleg Nesterov @ 2015-08-31 12:51 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Eric Dumazet, Al Viro, linux-kernel@vger.kernel.org, Andrew Morton, Thomas Gleixner, Maciej Żenczykowski On 08/31, Ingo Molnar wrote: > > * Oleg Nesterov <oleg@redhat.com> wrote: > > > On 08/29, Ingo Molnar wrote: > > > > > > So I'm wondering, is there any strong reason why we couldn't use a double linked > > > list and still do FIFO and remove that silly linear list walking hack? > > > > This will obviously enlarge callback_head, and it is often embedded. > > But this is minor. > > > > If we use a double linked list we can't do task_work_add() lockless. > > So we will need another spinlock_t in task_struct. We can't use pi_lock. > > The fact that the O(N) overhead was measured in real apps to be in the > milliseconds IMHO weakens cycle-level concerns about also having a spinlock next > to the list head. (There's no additional cacheline bouncing concerns with the > spinlock: the head of a LIFO list is essentially a bouncing cacheline.) I agree. I just tried to explain that we need a bit more changes than just s/callback_head/list_head/ in task_struct. And. The fact that this O(N) overhead was measured means that we have more overhead with offload-fput-to-exit_task_work which would be nice to remove as well. Oleg. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] task_work: remove fifo ordering guarantee 2015-08-29 2:42 [PATCH] task_work: remove fifo ordering guarantee Eric Dumazet 2015-08-29 3:19 ` Linus Torvalds @ 2015-08-29 12:49 ` Oleg Nesterov 2015-08-29 13:57 ` Eric Dumazet 2015-09-05 5:35 ` [PATCH] task_work: remove fifo ordering guarantee Al Viro 1 sibling, 2 replies; 20+ messages in thread From: Oleg Nesterov @ 2015-08-29 12:49 UTC (permalink / raw) To: Eric Dumazet Cc: Al Viro, Linus Torvalds, linux-kernel@vger.kernel.org, Andrew Morton, Thomas Gleixner, Ingo Molnar, Maciej Żenczykowski On 08/28, Eric Dumazet wrote: > > From: Eric Dumazet <edumazet@google.com> > > In commit f341861fb0b ("task_work: add a scheduling point in > task_work_run()") I fixed a latency problem adding a cond_resched() > call. > > Later, commit ac3d0da8f329 added yet another loop to reverse a list, > bringing back the latency spike : > > I've seen in some cases this loop taking 275 ms, if for example a > process with 2,000,000 files is killed. > > We could add yet another cond_resched() in the reverse loop, Can't we do this? > or we > can simply remove the reversal, as I do not think anything > would depend on order of task_work_add() submitted works. Personally I'd prefer to keep the fifo ordering. It just makes more sense imho. Even if currently nobody depends on it (although I am not sure about out-of-tree modules, say, systemtap). Let's look keyctl_session_to_parent(). It does task_work_cancel() but only because we can not trust user-space. Otherwise we could remove it and just do task_work_add(), but this needs fifo. Fifo just looks more sane to me. Oleg. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] task_work: remove fifo ordering guarantee 2015-08-29 12:49 ` Oleg Nesterov @ 2015-08-29 13:57 ` Eric Dumazet 2015-08-29 14:11 ` Eric Dumazet 2015-08-31 12:05 ` change filp_close() to use __fput_sync() ? (Was: [PATCH] task_work: remove fifo ordering guarantee) Oleg Nesterov 2015-09-05 5:35 ` [PATCH] task_work: remove fifo ordering guarantee Al Viro 1 sibling, 2 replies; 20+ messages in thread From: Eric Dumazet @ 2015-08-29 13:57 UTC (permalink / raw) To: Oleg Nesterov Cc: Al Viro, Linus Torvalds, linux-kernel@vger.kernel.org, Andrew Morton, Thomas Gleixner, Ingo Molnar, Maciej Żenczykowski On Sat, 2015-08-29 at 14:49 +0200, Oleg Nesterov wrote: > On 08/28, Eric Dumazet wrote: > > > > From: Eric Dumazet <edumazet@google.com> > > > > In commit f341861fb0b ("task_work: add a scheduling point in > > task_work_run()") I fixed a latency problem adding a cond_resched() > > call. > > > > Later, commit ac3d0da8f329 added yet another loop to reverse a list, > > bringing back the latency spike : > > > > I've seen in some cases this loop taking 275 ms, if for example a > > process with 2,000,000 files is killed. > > > > We could add yet another cond_resched() in the reverse loop, > > Can't we do this? Well, I stated in the changelog we could. Obviously we can. Adding 275 ms of pure overhead to perform this list reversal for files to be closed is quite unfortunate. > Personally I'd prefer to keep the fifo ordering. It just makes > more sense imho. Even if currently nobody depends on it (although > I am not sure about out-of-tree modules, say, systemtap). > > Let's look keyctl_session_to_parent(). It does task_work_cancel() > but only because we can not trust user-space. Otherwise we could > remove it and just do task_work_add(), but this needs fifo. So it looks like there is no problem today, right, other than the possibility to parse a long list while blocking IRQ ? > > Fifo just looks more sane to me. Well, files are closed in a random order. These are the main user of this stuff. If this is that critical, maybe use 2 lists, one for stuff needing fifo, and another one for un-ordered stuff (ed : file closing), and add a boolean to task_work_add()/task_work_cancel(). This adds yet another field into struct task_struct. Now we also could question why we needed commit 4a9d4b024a3102fc083c925c242d98ac27b1c5f6 ("switch fput to task_work_add ") since it seems quite an overhead at task exit with 10^6 of files to close. I understood the 'schedule_work() for interrupt/kernel_thread callers' part, but not the task_work_add() one. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] task_work: remove fifo ordering guarantee 2015-08-29 13:57 ` Eric Dumazet @ 2015-08-29 14:11 ` Eric Dumazet 2015-08-29 17:08 ` Linus Torvalds 2015-08-31 12:05 ` change filp_close() to use __fput_sync() ? (Was: [PATCH] task_work: remove fifo ordering guarantee) Oleg Nesterov 1 sibling, 1 reply; 20+ messages in thread From: Eric Dumazet @ 2015-08-29 14:11 UTC (permalink / raw) To: Oleg Nesterov Cc: Al Viro, Linus Torvalds, linux-kernel@vger.kernel.org, Andrew Morton, Thomas Gleixner, Ingo Molnar, Maciej Żenczykowski On Sat, 2015-08-29 at 06:57 -0700, Eric Dumazet wrote: > Now we also could question why we needed commit > 4a9d4b024a3102fc083c925c242d98ac27b1c5f6 ("switch fput to task_work_add > ") since it seems quite an overhead at task exit with 10^6 of files to > close. > > I understood the 'schedule_work() for interrupt/kernel_thread callers' > part, but not the task_work_add() one. If this needs to be kept, maybe then add following, to make sure we flush the list at most every BITS_PER_LONG files diff --git a/fs/file.c b/fs/file.c index 6c672ad329e9..f3d0a79cef05 100644 --- a/fs/file.c +++ b/fs/file.c @@ -22,6 +22,7 @@ #include <linux/spinlock.h> #include <linux/rcupdate.h> #include <linux/workqueue.h> +#include <linux/task_work.h> int sysctl_nr_open __read_mostly = 1024*1024; int sysctl_nr_open_min = BITS_PER_LONG; @@ -392,6 +393,7 @@ static struct fdtable *close_files(struct files_struct * files) i++; set >>= 1; } + task_work_run(); } return fdt; ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [PATCH] task_work: remove fifo ordering guarantee 2015-08-29 14:11 ` Eric Dumazet @ 2015-08-29 17:08 ` Linus Torvalds 2015-08-31 5:22 ` yalin wang ` (2 more replies) 0 siblings, 3 replies; 20+ messages in thread From: Linus Torvalds @ 2015-08-29 17:08 UTC (permalink / raw) To: Eric Dumazet Cc: Oleg Nesterov, Al Viro, linux-kernel@vger.kernel.org, Andrew Morton, Thomas Gleixner, Ingo Molnar, Maciej Żenczykowski On Sat, Aug 29, 2015 at 7:11 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > > If this needs to be kept, maybe then add following, to make sure > we flush the list at most every BITS_PER_LONG files Hmm. I'm wondering if we should just make close_files() (or maybe even filp_close()) use a synchronous fput(). Iirc, the reason we delay fput() is that we had some nasty issues for the generic fput case. It was called from interrupt context by the aio code, and just in general there's a lot of nasty cases that can cause the final fput to happen (so there are lockdep issues with the mmap locks because the last fput being from munmap etc). Maybe I forget some detail - it's been several years by now - but I think we could make the regular "close()" and "exit()" cases just use the synchronous fput (it's called "__fput_sync()" and currently explicitly limited to just kernel threads). Al? Because it feels all kinds of stupid to add things to the task-work queue just to then remove it almost immediately again. And close_files() is also called from various contexts. but the whole "put the final 'files_struct' case is certainly not at all as special as the 'put the final file'. Linus ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] task_work: remove fifo ordering guarantee 2015-08-29 17:08 ` Linus Torvalds @ 2015-08-31 5:22 ` yalin wang 2015-09-05 5:19 ` Al Viro 2015-08-31 12:44 ` Oleg Nesterov 2015-09-05 5:12 ` Al Viro 2 siblings, 1 reply; 20+ messages in thread From: yalin wang @ 2015-08-31 5:22 UTC (permalink / raw) To: Linus Torvalds Cc: Eric Dumazet, Oleg Nesterov, Al Viro, linux-kernel@vger.kernel.org, Andrew Morton, Thomas Gleixner, Ingo Molnar, Maciej Żenczykowski > On Aug 30, 2015, at 01:08, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Sat, Aug 29, 2015 at 7:11 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote: >> >> If this needs to be kept, maybe then add following, to make sure >> we flush the list at most every BITS_PER_LONG files > > Hmm. > > I'm wondering if we should just make close_files() (or maybe even > filp_close()) use a synchronous fput(). > > Iirc, the reason we delay fput() is that we had some nasty issues for > the generic fput case. It was called from interrupt context by the aio > code, and just in general there's a lot of nasty cases that can cause > the final fput to happen (so there are lockdep issues with the mmap > locks because the last fput being from munmap etc). > > Maybe I forget some detail - it's been several years by now - but I > think we could make the regular "close()" and "exit()" cases just use > the synchronous fput (it's called "__fput_sync()" and currently > explicitly limited to just kernel threads). > > Al? > > Because it feels all kinds of stupid to add things to the task-work > queue just to then remove it almost immediately again. And > close_files() is also called from various contexts. but the whole "put > the final 'files_struct' case is certainly not at all as special as > the 'put the final file'. > > Linus > — why not provide API like: fput() fput_nosync() ? because synchronous version are reasonable and safe in most time, let the user to select which version to use is more feasible, no matter if it is kthread or not. Thanks ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] task_work: remove fifo ordering guarantee 2015-08-31 5:22 ` yalin wang @ 2015-09-05 5:19 ` Al Viro 0 siblings, 0 replies; 20+ messages in thread From: Al Viro @ 2015-09-05 5:19 UTC (permalink / raw) To: yalin wang Cc: Linus Torvalds, Eric Dumazet, Oleg Nesterov, linux-kernel@vger.kernel.org, Andrew Morton, Thomas Gleixner, Ingo Molnar, Maciej Żenczykowski On Mon, Aug 31, 2015 at 01:22:26PM +0800, yalin wang wrote: > why not provide API like: > fput() > fput_nosync() ? > > because synchronous version are reasonable and safe in most time, > let the user to select which version to use is more feasible, no matter if it is kthread or not. Synchronous version is *NOT* safe in a lot of situations, from "deep enough in kernel stack" to "now a function seven levels out in call chain happens to hold a mutex grabbed elsewhere inside a mutex taken by unexpected ->release() instance, causing a deadlock", etc. It's not sync vs. async; we still guarantee execution before return from syscall. The only case when we really get async is kernel threads - there we do *not* return to userland at all, so we have to schedule it really asynchronous. Which is why we need an explicit sync version (for kernel threads only, not exported, don't use unless you really understand what you are doing and can explain why that particular case is safe, etc.) ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] task_work: remove fifo ordering guarantee 2015-08-29 17:08 ` Linus Torvalds 2015-08-31 5:22 ` yalin wang @ 2015-08-31 12:44 ` Oleg Nesterov 2015-09-05 5:12 ` Al Viro 2 siblings, 0 replies; 20+ messages in thread From: Oleg Nesterov @ 2015-08-31 12:44 UTC (permalink / raw) To: Linus Torvalds Cc: Eric Dumazet, Al Viro, linux-kernel@vger.kernel.org, Andrew Morton, Thomas Gleixner, Ingo Molnar, Maciej Żenczykowski On 08/29, Linus Torvalds wrote: > > On Sat, Aug 29, 2015 at 7:11 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > > > > If this needs to be kept, maybe then add following, to make sure > > we flush the list at most every BITS_PER_LONG files > > Hmm. > > I'm wondering if we should just make close_files() (or maybe even > filp_close()) use a synchronous fput(). Heh. I thought about the same change. So perhaps it is even the right thing to do. Still I am worried, because "it can't be that simple" ;) And, with this change close_files() is called before exit_fs() and exit_task_namespaces(). This is fine (iiuc), but this means that the creative code in drivers/ can (wrongly) rely on this fact again. IIRC, the change which moved __fput() into task_work_exit() uncovered some interesting problems, like filp_open() called from fop->release(). Anyway, this is the question to Al, I guess. Oleg. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] task_work: remove fifo ordering guarantee 2015-08-29 17:08 ` Linus Torvalds 2015-08-31 5:22 ` yalin wang 2015-08-31 12:44 ` Oleg Nesterov @ 2015-09-05 5:12 ` Al Viro 2015-09-05 5:42 ` Al Viro 2 siblings, 1 reply; 20+ messages in thread From: Al Viro @ 2015-09-05 5:12 UTC (permalink / raw) To: Linus Torvalds Cc: Eric Dumazet, Oleg Nesterov, linux-kernel@vger.kernel.org, Andrew Morton, Thomas Gleixner, Ingo Molnar, Maciej Żenczykowski On Sat, Aug 29, 2015 at 10:08:30AM -0700, Linus Torvalds wrote: > Hmm. > > I'm wondering if we should just make close_files() (or maybe even > filp_close()) use a synchronous fput(). > > Iirc, the reason we delay fput() is that we had some nasty issues for > the generic fput case. It was called from interrupt context by the aio > code, and just in general there's a lot of nasty cases that can cause > the final fput to happen (so there are lockdep issues with the mmap > locks because the last fput being from munmap etc). > > Maybe I forget some detail - it's been several years by now - but I > think we could make the regular "close()" and "exit()" cases just use > the synchronous fput (it's called "__fput_sync()" and currently > explicitly limited to just kernel threads). > > Al? First of all, we'd better not count on e.g. delayed fput() *NOT* doing task_work_add() - we still need to check if any new work had been added. After all, final close() might very well have done a final mntput() on a lazy-unmounted filesystem, possibly leaving us with fs shutdown via task_work_add(). And if that sucker e.g. closes a socket, well, we are back to closing an opened struct file, with task_work_add() etc. I'm a bit nervious about filp_close() (that sucker is exported and widely abused), but close_files()... sure, shouldn't be a problem. And yes, we can teach __close_fd() to do the same. I really don't understand what's the benefit, though - it's about the case when we are closing the last descriptor for given opened file, so I would be rather surprised if slower path taken on the way out to userland was not lost in noise... ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] task_work: remove fifo ordering guarantee 2015-09-05 5:12 ` Al Viro @ 2015-09-05 5:42 ` Al Viro 2015-09-05 20:46 ` Linus Torvalds 0 siblings, 1 reply; 20+ messages in thread From: Al Viro @ 2015-09-05 5:42 UTC (permalink / raw) To: Linus Torvalds Cc: Eric Dumazet, Oleg Nesterov, linux-kernel@vger.kernel.org, Andrew Morton, Thomas Gleixner, Ingo Molnar, Maciej Żenczykowski On Sat, Sep 05, 2015 at 06:12:34AM +0100, Al Viro wrote: > First of all, we'd better not count on e.g. delayed fput() *NOT* doing > task_work_add() - we still need to check if any new work had been added. > After all, final close() might very well have done a final mntput() > on a lazy-unmounted filesystem, possibly leaving us with fs shutdown via > task_work_add(). And if that sucker e.g. closes a socket, well, we are > back to closing an opened struct file, with task_work_add() etc. > > I'm a bit nervious about filp_close() (that sucker is exported and widely > abused), but close_files()... sure, shouldn't be a problem. And yes, > we can teach __close_fd() to do the same. I really don't understand what's > the benefit, though - it's about the case when we are closing the last > descriptor for given opened file, so I would be rather surprised if slower > path taken on the way out to userland was not lost in noise... OK, having found the beginning of the thread, I understand what is being attempted, but... why the hell bother with FIFO in the first place? AFAICS, task_work_add() uses in VFS (final fput() and final mntput() alike) do not care about the FIFO at all. Sure, some out-of-tree mer^H^Hodule might rely on that. So what? IMO, unless we have a good in-tree reason for insisting on FIFO, dropping it is the most obvious solution... ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] task_work: remove fifo ordering guarantee 2015-09-05 5:42 ` Al Viro @ 2015-09-05 20:46 ` Linus Torvalds 0 siblings, 0 replies; 20+ messages in thread From: Linus Torvalds @ 2015-09-05 20:46 UTC (permalink / raw) To: Al Viro Cc: Eric Dumazet, Oleg Nesterov, linux-kernel@vger.kernel.org, Andrew Morton, Thomas Gleixner, Ingo Molnar, Maciej Żenczykowski On Fri, Sep 4, 2015 at 10:42 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > OK, having found the beginning of the thread, I understand what is being > attempted, but... why the hell bother with FIFO in the first place? AFAICS, > task_work_add() uses in VFS (final fput() and final mntput() alike) > do not care about the FIFO at all. > > Sure, some out-of-tree mer^H^Hodule might rely on that. So what? > > IMO, unless we have a good in-tree reason for insisting on FIFO, dropping it > is the most obvious solution... I agree. We should just try that. I'll apply Eric's patch from the beginning of this tree, and let's just see if anybody ever notices. Removing code and possibly fixing a latency issue sounds like a win-win. Linus ^ permalink raw reply [flat|nested] 20+ messages in thread
* change filp_close() to use __fput_sync() ? (Was: [PATCH] task_work: remove fifo ordering guarantee) 2015-08-29 13:57 ` Eric Dumazet 2015-08-29 14:11 ` Eric Dumazet @ 2015-08-31 12:05 ` Oleg Nesterov 1 sibling, 0 replies; 20+ messages in thread From: Oleg Nesterov @ 2015-08-31 12:05 UTC (permalink / raw) To: Eric Dumazet Cc: Al Viro, Linus Torvalds, linux-kernel@vger.kernel.org, Andrew Morton, Thomas Gleixner, Ingo Molnar, Maciej Żenczykowski On 08/29, Eric Dumazet wrote: > > On Sat, 2015-08-29 at 14:49 +0200, Oleg Nesterov wrote: > > On 08/28, Eric Dumazet wrote: > > > > > > From: Eric Dumazet <edumazet@google.com> > > > > > > In commit f341861fb0b ("task_work: add a scheduling point in > > > task_work_run()") I fixed a latency problem adding a cond_resched() > > > call. > > > > > > Later, commit ac3d0da8f329 added yet another loop to reverse a list, > > > bringing back the latency spike : > > > > > > I've seen in some cases this loop taking 275 ms, if for example a > > > process with 2,000,000 files is killed. > > > > > > We could add yet another cond_resched() in the reverse loop, > > > > Can't we do this? > > Well, I stated in the changelog we could. Obviously we can. > > Adding 275 ms of pure overhead to perform this list reversal for files > to be closed is quite unfortunate. Well, if the first loop takes 275 ms, then probably the next one which actually does a lot of __fput's takes much, much more time, so perhaps these 275 ms are not very noticable. Ignoring the latency problem. But of course, this is not good, I agree. Please see below. > > Fifo just looks more sane to me. > > Well, files are closed in a random order. These are the main user of > this stuff. This is the most "heavy" user. But task_works is the generic API. > Now we also could question why we needed commit > 4a9d4b024a3102fc083c925c242d98ac27b1c5f6 ("switch fput to task_work_add > ") since it seems quite an overhead at task exit with 10^6 of files to > close. How about the patch below? I didn't try to test it yet, but since filp_close() does ->flush() I think __fput_sync() should be safe here. Al, what do you think? Oleg. --- x/fs/file_table.c +++ x/fs/file_table.c @@ -292,11 +292,8 @@ void fput(struct file *file) */ void __fput_sync(struct file *file) { - if (atomic_long_dec_and_test(&file->f_count)) { - struct task_struct *task = current; - BUG_ON(!(task->flags & PF_KTHREAD)); + if (atomic_long_dec_and_test(&file->f_count)) __fput(file); - } } EXPORT_SYMBOL(fput); --- x/fs/open.c +++ x/fs/open.c @@ -1074,7 +1074,7 @@ int filp_close(struct file *filp, fl_owner_t id) dnotify_flush(filp, id); locks_remove_posix(filp, id); } - fput(filp); + __fput_sync(filp); return retval; } ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] task_work: remove fifo ordering guarantee 2015-08-29 12:49 ` Oleg Nesterov 2015-08-29 13:57 ` Eric Dumazet @ 2015-09-05 5:35 ` Al Viro 2015-09-07 12:27 ` [PATCH?] fput: don't abuse task_work_add() too much Oleg Nesterov 1 sibling, 1 reply; 20+ messages in thread From: Al Viro @ 2015-09-05 5:35 UTC (permalink / raw) To: Oleg Nesterov Cc: Eric Dumazet, Linus Torvalds, linux-kernel@vger.kernel.org, Andrew Morton, Thomas Gleixner, Ingo Molnar, Maciej Żenczykowski On Sat, Aug 29, 2015 at 02:49:21PM +0200, Oleg Nesterov wrote: > Let's look keyctl_session_to_parent(). It does task_work_cancel() > but only because we can not trust user-space. Otherwise we could > remove it and just do task_work_add(), but this needs fifo. > > Fifo just looks more sane to me. Not if it costs us. As far as files closing is concerned, the order really doesn't matter. Ditto for final mntput() uses of that stuff. What *does* matter is task_work_add() issued by callback not getting lost. IMO the obvious solution is to lose the reordering... ^ permalink raw reply [flat|nested] 20+ messages in thread
* [PATCH?] fput: don't abuse task_work_add() too much 2015-09-05 5:35 ` [PATCH] task_work: remove fifo ordering guarantee Al Viro @ 2015-09-07 12:27 ` Oleg Nesterov 2015-09-07 13:49 ` [PATCH? v2] " Oleg Nesterov 0 siblings, 1 reply; 20+ messages in thread From: Oleg Nesterov @ 2015-09-07 12:27 UTC (permalink / raw) To: Al Viro, Linus Torvalds Cc: Eric Dumazet, linux-kernel@vger.kernel.org, Andrew Morton, Thomas Gleixner, Ingo Molnar, Maciej Żenczykowski On 09/05, Al Viro wrote: > > On Sat, Aug 29, 2015 at 02:49:21PM +0200, Oleg Nesterov wrote: > > > Let's look keyctl_session_to_parent(). It does task_work_cancel() > > but only because we can not trust user-space. Otherwise we could > > remove it and just do task_work_add(), but this needs fifo. > > > > Fifo just looks more sane to me. > > Not if it costs us. OK, Eric reports this trivial lockless loop takes 275ms. Apparently the list of ____fput() works is huge. Now, how much does it cost to _create_ this huge list using task_work_add() which does cmpxchg() ? I guess a bit more. This is what should be fixed if we think this hurts performance-wise. And again, ignoring the latency problem due to the lack of cond_resched, I am wondering if these 275ms are actually noticable compared to the time the next loop needs to call all these ____fput's. > As far as files closing is concerned, the order > really doesn't matter. Ditto for final mntput() uses of that stuff. Sure, fput/mntput do not care about the ordering. And more, they do not even really need task_work_add(). We can just remove it from fput() and everything will work fine. Correctness-wise, I mean. And yes, unfortunately we do not have in-kernel users which already rely on fifo. But task_work_add() is the generic API, loosing the ordering makes it less useful or at least less convenient. Just for (yes sure, artificial) example. Suppose we want to implement sys_change_process_flags(int pid, uint set, uint clr). Only current can change its ->flags, so we can use task_work_add() to do this. But obviously only if it is fifo. fput() differs because it does not care which process actually does __fput(). And thus imo we should not count this user if we need to decide do we need fifo or not. > IMO the obvious solution is to lose the reordering... Oh, I disagree. But I guess I can't convince you/Eric/Linus, so I have to shut up. Damn. But I can't relax ;) Al, Linus, could you comment the patch below? Not for inclusion, lacks the changelog/testing, fput() can be simplified. But as you can see it is simple. With this patch task_work_add(____fput) will be called only once by (say) do_exit() path. ->fput_list does not need any serialization / atomic ops / etc. Probably we also need to move cond_resched() from task_work_run() to ____fput() after this patch. Again, it is not that I think this actually makes sense, but since you dislike these 275ms... What do you think? Oleg. --- diff --git a/fs/file_table.c b/fs/file_table.c index 294174d..36af701 100644 --- a/fs/file_table.c +++ b/fs/file_table.c @@ -241,7 +241,13 @@ static void delayed_fput(struct work_struct *unused) static void ____fput(struct callback_head *work) { - __fput(container_of(work, struct file, f_u.fu_rcuhead)); + struct task_struct *task = current; + do { + struct file *file = task->fput_list; + task->fput_list = file->f_u.fu_next; + __fput(file); + + } while (task->fput_list); } /* @@ -267,9 +273,19 @@ void fput(struct file *file) struct task_struct *task = current; if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) { + if (task->fput_list) { + /* task_work_add() below was already called */ + file->f_u.fu_next = task->fput_list; + task->fput_list = file; + return; + } + init_task_work(&file->f_u.fu_rcuhead, ____fput); - if (!task_work_add(task, &file->f_u.fu_rcuhead, true)) + if (!task_work_add(task, &file->f_u.fu_rcuhead, true)) { + file->f_u.fu_next = NULL; + task->fput_list = file; return; + } /* * After this task has run exit_task_work(), * task_work_add() will fail. Fall through to delayed diff --git a/include/linux/fs.h b/include/linux/fs.h index 0774487..73fe16c 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -830,6 +830,7 @@ struct file { union { struct llist_node fu_llist; struct rcu_head fu_rcuhead; + struct file *fu_next; } f_u; struct path f_path; struct inode *f_inode; /* cached value */ diff --git a/include/linux/sched.h b/include/linux/sched.h index f192cfe..6f704ff 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1477,6 +1477,7 @@ struct task_struct { struct fs_struct *fs; /* open file information */ struct files_struct *files; + struct file *fput_list; /* namespaces */ struct nsproxy *nsproxy; /* signal handlers */ diff --git a/kernel/fork.c b/kernel/fork.c index 03c1eaa..77c0a50 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1007,6 +1007,7 @@ static int copy_files(unsigned long clone_flags, struct task_struct *tsk) struct files_struct *oldf, *newf; int error = 0; + tsk->fput_list = NULL; /* * A background process may not have any files ... */ ^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH? v2] fput: don't abuse task_work_add() too much 2015-09-07 12:27 ` [PATCH?] fput: don't abuse task_work_add() too much Oleg Nesterov @ 2015-09-07 13:49 ` Oleg Nesterov 0 siblings, 0 replies; 20+ messages in thread From: Oleg Nesterov @ 2015-09-07 13:49 UTC (permalink / raw) To: Al Viro, Linus Torvalds Cc: Eric Dumazet, linux-kernel@vger.kernel.org, Andrew Morton, Thomas Gleixner, Ingo Molnar, Maciej Żenczykowski On 09/07, Oleg Nesterov wrote: > > Oh, I disagree. But I guess I can't convince you/Eric/Linus, so I have > to shut up. > > > Damn. But I can't relax ;) Al, Linus, could you comment the patch below? > > Not for inclusion, lacks the changelog/testing, fput() can be simplified. > But as you can see it is simple. With this patch task_work_add(____fput) > will be called only once by (say) do_exit() path. ->fput_list does not > need any serialization / atomic ops / etc. Probably we also need to move > cond_resched() from task_work_run() to ____fput() after this patch. > > Again, it is not that I think this actually makes sense, but since you > dislike these 275ms... > > What do you think? Yes, task_struct->fput_list is ugly. We can avoid it, but then we need another ->next pointer in struct file. Perhaps we can reuse ->f_version? This way the change looks really simple and not too bad to me. Although I am not sure you will agree. Oleg. --- diff --git a/fs/file_table.c b/fs/file_table.c index 294174d..c34b666 100644 --- a/fs/file_table.c +++ b/fs/file_table.c @@ -241,7 +241,15 @@ static void delayed_fput(struct work_struct *unused) static void ____fput(struct callback_head *work) { - __fput(container_of(work, struct file, f_u.fu_rcuhead)); + struct file *file = container_of(work, struct file, f_u.fu_rcuhead); + struct file *next; + + do { + next = file->f_next_put; + __fput(file); + file = next; + + } while (file); } /* @@ -267,9 +275,21 @@ void fput(struct file *file) struct task_struct *task = current; if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) { + struct callback_head *work = READ_ONCE(task->task_works); + struct file *prev; + + if (work && work->func == ____fput) { + prev = container_of(work, struct file, f_u.fu_rcuhead); + file->f_next_put = prev->f_next_put; + prev->f_next_put = file; + return; + } + init_task_work(&file->f_u.fu_rcuhead, ____fput); - if (!task_work_add(task, &file->f_u.fu_rcuhead, true)) + if (!task_work_add(task, &file->f_u.fu_rcuhead, true)) { + file->f_next_put = NULL; return; + } /* * After this task has run exit_task_work(), * task_work_add() will fail. Fall through to delayed diff --git a/include/linux/fs.h b/include/linux/fs.h index 0774487..9381527 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -849,7 +849,10 @@ struct file { const struct cred *f_cred; struct file_ra_state f_ra; - u64 f_version; + union { + u64 f_version; + struct file *f_next_put; + }; #ifdef CONFIG_SECURITY void *f_security; #endif ^ permalink raw reply related [flat|nested] 20+ messages in thread
end of thread, other threads:[~2015-09-07 13:52 UTC | newest] Thread overview: 20+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-08-29 2:42 [PATCH] task_work: remove fifo ordering guarantee Eric Dumazet 2015-08-29 3:19 ` Linus Torvalds 2015-08-29 9:22 ` Ingo Molnar 2015-08-29 12:54 ` Oleg Nesterov 2015-08-31 6:02 ` Ingo Molnar 2015-08-31 12:51 ` Oleg Nesterov 2015-08-29 12:49 ` Oleg Nesterov 2015-08-29 13:57 ` Eric Dumazet 2015-08-29 14:11 ` Eric Dumazet 2015-08-29 17:08 ` Linus Torvalds 2015-08-31 5:22 ` yalin wang 2015-09-05 5:19 ` Al Viro 2015-08-31 12:44 ` Oleg Nesterov 2015-09-05 5:12 ` Al Viro 2015-09-05 5:42 ` Al Viro 2015-09-05 20:46 ` Linus Torvalds 2015-08-31 12:05 ` change filp_close() to use __fput_sync() ? (Was: [PATCH] task_work: remove fifo ordering guarantee) Oleg Nesterov 2015-09-05 5:35 ` [PATCH] task_work: remove fifo ordering guarantee Al Viro 2015-09-07 12:27 ` [PATCH?] fput: don't abuse task_work_add() too much Oleg Nesterov 2015-09-07 13:49 ` [PATCH? v2] " Oleg Nesterov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox