[PATCH] task_work: remove fifo ordering guarantee

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] task_work: remove fifo ordering guarantee
@ 2015-08-29  2:42 Eric Dumazet
  2015-08-29  3:19 ` Linus Torvalds
  2015-08-29 12:49 ` Oleg Nesterov
  0 siblings, 2 replies; 20+ messages in thread
From: Eric Dumazet @ 2015-08-29  2:42 UTC (permalink / raw)
  To: Al Viro, Linus Torvalds
  Cc: linux-kernel@vger.kernel.org, Oleg Nesterov, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Maciej Żenczykowski

From: Eric Dumazet <edumazet@google.com>

In commit f341861fb0b ("task_work: add a scheduling point in
task_work_run()") I fixed a latency problem adding a cond_resched()
call.

Later, commit ac3d0da8f329 added yet another loop to reverse a list,
bringing back the latency spike :

I've seen in some cases this loop taking 275 ms, if for example a
process with 2,000,000 files is killed.

We could add yet another cond_resched() in the reverse loop, or we
can simply remove the reversal, as I do not think anything
would depend on order of task_work_add() submitted works.

Fixes: ac3d0da8f329 ("task_work: Make task_work_add() lockless")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Maciej Żenczykowski <maze@google.com>
---
 kernel/task_work.c |   12 ++----------
 1 file changed, 2 insertions(+), 10 deletions(-)

diff --git a/kernel/task_work.c b/kernel/task_work.c
index 8727032e3a6f..53fa971d000d 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -18,6 +18,8 @@ static struct callback_head work_exited; /* all we need is ->next == NULL */
  * This is like the signal handler which runs in kernel mode, but it doesn't
  * try to wake up the @task.
  *
+ * Note: there is no ordering guarantee on works queued here.
+ *
  * RETURNS:
  * 0 if succeeds or -ESRCH.
  */
@@ -108,16 +110,6 @@ void task_work_run(void)
 		raw_spin_unlock_wait(&task->pi_lock);
 		smp_mb();
 
-		/* Reverse the list to run the works in fifo order */
-		head = NULL;
-		do {
-			next = work->next;
-			work->next = head;
-			head = work;
-			work = next;
-		} while (work);
-
-		work = head;
 		do {
 			next = work->next;
 			work->func(work);



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH] task_work: remove fifo ordering guarantee
  2015-08-29  2:42 [PATCH] task_work: remove fifo ordering guarantee Eric Dumazet
@ 2015-08-29  3:19 ` Linus Torvalds
  2015-08-29  9:22   ` Ingo Molnar
  2015-08-29 12:49 ` Oleg Nesterov
  1 sibling, 1 reply; 20+ messages in thread
From: Linus Torvalds @ 2015-08-29  3:19 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Al Viro, linux-kernel@vger.kernel.org, Oleg Nesterov,
	Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Maciej Żenczykowski

On Fri, Aug 28, 2015 at 7:42 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> We could add yet another cond_resched() in the reverse loop, or we
> can simply remove the reversal, as I do not think anything
> would depend on order of task_work_add() submitted works.

So I think this should be ok, with things like file closing not really
caring about ordering as far as I can tell.

However, has anybody gone through all the task-work users? I looked
quickly at the task_work_add() cases, and didn't see anything that
looked like it would care, but others should look too. In the vfs,
theres' the delayed fput and mnt freeing, and there's a keyring
installation one.

The threaded irq handlers use it as that exit-time hack, which
certainly shouldn't care, and there's some uprobe thing.

Can anybody see anything fishy?

               Linus

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] task_work: remove fifo ordering guarantee
  2015-08-29  3:19 ` Linus Torvalds
@ 2015-08-29  9:22   ` Ingo Molnar
  2015-08-29 12:54     ` Oleg Nesterov
  0 siblings, 1 reply; 20+ messages in thread
From: Ingo Molnar @ 2015-08-29  9:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, Al Viro, linux-kernel@vger.kernel.org,
	Oleg Nesterov, Andrew Morton, Thomas Gleixner,
	Maciej Żenczykowski


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Fri, Aug 28, 2015 at 7:42 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >
> > We could add yet another cond_resched() in the reverse loop, or we can simply 
> > remove the reversal, as I do not think anything would depend on order of 
> > task_work_add() submitted works.
> 
> So I think this should be ok, with things like file closing not really caring 
> about ordering as far as I can tell.
> 
> However, has anybody gone through all the task-work users? I looked quickly at 
> the task_work_add() cases, and didn't see anything that looked like it would 
> care, but others should look too. In the vfs, theres' the delayed fput and mnt 
> freeing, and there's a keyring installation one.
> 
> The threaded irq handlers use it as that exit-time hack, which certainly 
> shouldn't care, and there's some uprobe thing.
> 
> Can anybody see anything fishy?

So I'm wondering, is there any strong reason why we couldn't use a double linked 
list and still do FIFO and remove that silly linear list walking hack?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] task_work: remove fifo ordering guarantee
  2015-08-29  9:22   ` Ingo Molnar
@ 2015-08-29 12:54     ` Oleg Nesterov
  2015-08-31  6:02       ` Ingo Molnar
  0 siblings, 1 reply; 20+ messages in thread
From: Oleg Nesterov @ 2015-08-29 12:54 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Eric Dumazet, Al Viro,
	linux-kernel@vger.kernel.org, Andrew Morton, Thomas Gleixner,
	Maciej Żenczykowski

On 08/29, Ingo Molnar wrote:
>
> So I'm wondering, is there any strong reason why we couldn't use a double linked
> list and still do FIFO and remove that silly linear list walking hack?

This will obviously enlarge callback_head, and it is often embedded.
But this is minor.

If we use a double linked list we can't do task_work_add() lockless.
So we will need another spinlock_t in task_struct. We can't use pi_lock.

Oleg.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] task_work: remove fifo ordering guarantee
  2015-08-29 12:54     ` Oleg Nesterov
@ 2015-08-31  6:02       ` Ingo Molnar
  2015-08-31 12:51         ` Oleg Nesterov
  0 siblings, 1 reply; 20+ messages in thread
From: Ingo Molnar @ 2015-08-31  6:02 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Eric Dumazet, Al Viro,
	linux-kernel@vger.kernel.org, Andrew Morton, Thomas Gleixner,
	Maciej Żenczykowski


* Oleg Nesterov <oleg@redhat.com> wrote:

> On 08/29, Ingo Molnar wrote:
> >
> > So I'm wondering, is there any strong reason why we couldn't use a double linked
> > list and still do FIFO and remove that silly linear list walking hack?
> 
> This will obviously enlarge callback_head, and it is often embedded.
> But this is minor.
> 
> If we use a double linked list we can't do task_work_add() lockless.
> So we will need another spinlock_t in task_struct. We can't use pi_lock.

The fact that the O(N) overhead was measured in real apps to be in the 
milliseconds IMHO weakens cycle-level concerns about also having a spinlock next 
to the list head. (There's no additional cacheline bouncing concerns with the 
spinlock: the head of a LIFO list is essentially a bouncing cacheline.)

If there's some other solution, sure, but LIFO queues tend to be trouble down the 
line.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] task_work: remove fifo ordering guarantee
  2015-08-31  6:02       ` Ingo Molnar
@ 2015-08-31 12:51         ` Oleg Nesterov
  0 siblings, 0 replies; 20+ messages in thread
From: Oleg Nesterov @ 2015-08-31 12:51 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Eric Dumazet, Al Viro,
	linux-kernel@vger.kernel.org, Andrew Morton, Thomas Gleixner,
	Maciej Żenczykowski

On 08/31, Ingo Molnar wrote:
>
> * Oleg Nesterov <oleg@redhat.com> wrote:
>
> > On 08/29, Ingo Molnar wrote:
> > >
> > > So I'm wondering, is there any strong reason why we couldn't use a double linked
> > > list and still do FIFO and remove that silly linear list walking hack?
> >
> > This will obviously enlarge callback_head, and it is often embedded.
> > But this is minor.
> >
> > If we use a double linked list we can't do task_work_add() lockless.
> > So we will need another spinlock_t in task_struct. We can't use pi_lock.
>
> The fact that the O(N) overhead was measured in real apps to be in the
> milliseconds IMHO weakens cycle-level concerns about also having a spinlock next
> to the list head. (There's no additional cacheline bouncing concerns with the
> spinlock: the head of a LIFO list is essentially a bouncing cacheline.)

I agree. I just tried to explain that we need a bit more changes than
just s/callback_head/list_head/ in task_struct.

And. The fact that this O(N) overhead was measured means that we have
more overhead with offload-fput-to-exit_task_work which would be nice
to remove as well.

Oleg.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] task_work: remove fifo ordering guarantee
  2015-08-29  2:42 [PATCH] task_work: remove fifo ordering guarantee Eric Dumazet
  2015-08-29  3:19 ` Linus Torvalds
@ 2015-08-29 12:49 ` Oleg Nesterov
  2015-08-29 13:57   ` Eric Dumazet
  2015-09-05  5:35   ` [PATCH] task_work: remove fifo ordering guarantee Al Viro
  1 sibling, 2 replies; 20+ messages in thread
From: Oleg Nesterov @ 2015-08-29 12:49 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Al Viro, Linus Torvalds, linux-kernel@vger.kernel.org,
	Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Maciej Żenczykowski

On 08/28, Eric Dumazet wrote:
>
> From: Eric Dumazet <edumazet@google.com>
>
> In commit f341861fb0b ("task_work: add a scheduling point in
> task_work_run()") I fixed a latency problem adding a cond_resched()
> call.
>
> Later, commit ac3d0da8f329 added yet another loop to reverse a list,
> bringing back the latency spike :
>
> I've seen in some cases this loop taking 275 ms, if for example a
> process with 2,000,000 files is killed.
>
> We could add yet another cond_resched() in the reverse loop,

Can't we do this?

> or we
> can simply remove the reversal, as I do not think anything
> would depend on order of task_work_add() submitted works.

Personally I'd prefer to keep the fifo ordering. It just makes
more sense imho. Even if currently nobody depends on it (although
I am not sure about out-of-tree modules, say, systemtap).

Let's look keyctl_session_to_parent(). It does task_work_cancel()
but only because we can not trust user-space. Otherwise we could
remove it and just do task_work_add(), but this needs fifo.

Fifo just looks more sane to me.

Oleg.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] task_work: remove fifo ordering guarantee
  2015-08-29 12:49 ` Oleg Nesterov
@ 2015-08-29 13:57   ` Eric Dumazet
  2015-08-29 14:11     ` Eric Dumazet
  2015-08-31 12:05     ` change filp_close() to use __fput_sync() ? (Was: [PATCH] task_work: remove fifo ordering guarantee) Oleg Nesterov
  2015-09-05  5:35   ` [PATCH] task_work: remove fifo ordering guarantee Al Viro
  1 sibling, 2 replies; 20+ messages in thread
From: Eric Dumazet @ 2015-08-29 13:57 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Al Viro, Linus Torvalds, linux-kernel@vger.kernel.org,
	Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Maciej Żenczykowski

On Sat, 2015-08-29 at 14:49 +0200, Oleg Nesterov wrote:
> On 08/28, Eric Dumazet wrote:
> >
> > From: Eric Dumazet <edumazet@google.com>
> >
> > In commit f341861fb0b ("task_work: add a scheduling point in
> > task_work_run()") I fixed a latency problem adding a cond_resched()
> > call.
> >
> > Later, commit ac3d0da8f329 added yet another loop to reverse a list,
> > bringing back the latency spike :
> >
> > I've seen in some cases this loop taking 275 ms, if for example a
> > process with 2,000,000 files is killed.
> >
> > We could add yet another cond_resched() in the reverse loop,
> 
> Can't we do this?

Well, I stated in the changelog we could. Obviously we can.

Adding 275 ms of pure overhead to perform this list reversal for files
to be closed is quite unfortunate.

> Personally I'd prefer to keep the fifo ordering. It just makes
> more sense imho. Even if currently nobody depends on it (although
> I am not sure about out-of-tree modules, say, systemtap).
> 
> Let's look keyctl_session_to_parent(). It does task_work_cancel()
> but only because we can not trust user-space. Otherwise we could
> remove it and just do task_work_add(), but this needs fifo.

So it looks like there is no problem today, right, other than the
possibility to parse a long list while blocking IRQ ?

> 
> Fifo just looks more sane to me.

Well, files are closed in a random order. These are the main user of
this stuff.

If this is that critical, maybe use 2 lists, one for stuff needing fifo,
and another one for un-ordered stuff (ed : file closing), and add a
boolean to task_work_add()/task_work_cancel(). This adds yet another
field into struct task_struct.

Now we also could question why we needed commit
4a9d4b024a3102fc083c925c242d98ac27b1c5f6 ("switch fput to task_work_add
") since it seems quite an overhead at task exit with 10^6 of files to
close.

I understood the 'schedule_work() for interrupt/kernel_thread callers'
part, but not the task_work_add() one.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] task_work: remove fifo ordering guarantee
  2015-08-29 13:57   ` Eric Dumazet
@ 2015-08-29 14:11     ` Eric Dumazet
  2015-08-29 17:08       ` Linus Torvalds
  2015-08-31 12:05     ` change filp_close() to use __fput_sync() ? (Was: [PATCH] task_work: remove fifo ordering guarantee) Oleg Nesterov
  1 sibling, 1 reply; 20+ messages in thread
From: Eric Dumazet @ 2015-08-29 14:11 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Al Viro, Linus Torvalds, linux-kernel@vger.kernel.org,
	Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Maciej Żenczykowski

On Sat, 2015-08-29 at 06:57 -0700, Eric Dumazet wrote:

> Now we also could question why we needed commit
> 4a9d4b024a3102fc083c925c242d98ac27b1c5f6 ("switch fput to task_work_add
> ") since it seems quite an overhead at task exit with 10^6 of files to
> close.
> 
> I understood the 'schedule_work() for interrupt/kernel_thread callers'
> part, but not the task_work_add() one.

If this needs to be kept, maybe then add following, to make sure
we flush the list at most every BITS_PER_LONG files

diff --git a/fs/file.c b/fs/file.c
index 6c672ad329e9..f3d0a79cef05 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -22,6 +22,7 @@
 #include <linux/spinlock.h>
 #include <linux/rcupdate.h>
 #include <linux/workqueue.h>
+#include <linux/task_work.h>
 
 int sysctl_nr_open __read_mostly = 1024*1024;
 int sysctl_nr_open_min = BITS_PER_LONG;
@@ -392,6 +393,7 @@ static struct fdtable *close_files(struct
files_struct * files)
 			i++;
 			set >>= 1;
 		}
+		task_work_run();
 	}
 
 	return fdt;



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH] task_work: remove fifo ordering guarantee
  2015-08-29 14:11     ` Eric Dumazet
@ 2015-08-29 17:08       ` Linus Torvalds
  2015-08-31  5:22         ` yalin wang
                           ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Linus Torvalds @ 2015-08-29 17:08 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Oleg Nesterov, Al Viro, linux-kernel@vger.kernel.org,
	Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Maciej Żenczykowski

On Sat, Aug 29, 2015 at 7:11 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> If this needs to be kept, maybe then add following, to make sure
> we flush the list at most every BITS_PER_LONG files

Hmm.

I'm wondering if we should just make close_files() (or maybe even
filp_close()) use a synchronous fput().

Iirc, the reason we delay fput() is that we had some nasty issues for
the generic fput case. It was called from interrupt context by the aio
code, and just in general there's a lot of nasty cases that can cause
the final fput to happen (so there are lockdep issues with the mmap
locks because the last fput being from munmap etc).

Maybe I forget some detail - it's been several years by now - but I
think we could make the regular "close()" and "exit()" cases just use
the synchronous fput (it's called "__fput_sync()" and currently
explicitly limited to just kernel threads).

Al?

Because it feels all kinds of stupid to add things to the task-work
queue just to then remove it almost immediately again. And
close_files() is also called from various contexts. but the whole "put
the final 'files_struct' case is certainly not at all as special as
the 'put the final file'.

                    Linus

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] task_work: remove fifo ordering guarantee
  2015-08-29 17:08       ` Linus Torvalds
@ 2015-08-31  5:22         ` yalin wang
  2015-09-05  5:19           ` Al Viro
  2015-08-31 12:44         ` Oleg Nesterov
  2015-09-05  5:12         ` Al Viro
  2 siblings, 1 reply; 20+ messages in thread
From: yalin wang @ 2015-08-31  5:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, Oleg Nesterov, Al Viro,
	linux-kernel@vger.kernel.org, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Maciej Żenczykowski


> On Aug 30, 2015, at 01:08, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> On Sat, Aug 29, 2015 at 7:11 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> 
>> If this needs to be kept, maybe then add following, to make sure
>> we flush the list at most every BITS_PER_LONG files
> 
> Hmm.
> 
> I'm wondering if we should just make close_files() (or maybe even
> filp_close()) use a synchronous fput().
> 
> Iirc, the reason we delay fput() is that we had some nasty issues for
> the generic fput case. It was called from interrupt context by the aio
> code, and just in general there's a lot of nasty cases that can cause
> the final fput to happen (so there are lockdep issues with the mmap
> locks because the last fput being from munmap etc).
> 
> Maybe I forget some detail - it's been several years by now - but I
> think we could make the regular "close()" and "exit()" cases just use
> the synchronous fput (it's called "__fput_sync()" and currently
> explicitly limited to just kernel threads).
> 
> Al?
> 
> Because it feels all kinds of stupid to add things to the task-work
> queue just to then remove it almost immediately again. And
> close_files() is also called from various contexts. but the whole "put
> the final 'files_struct' case is certainly not at all as special as
> the 'put the final file'.
> 
>                    Linus
> —
why not provide API like:
fput()
fput_nosync() ?

because synchronous version are reasonable and safe in most time,
let the user to select which version to use is more feasible, no matter if it is kthread or not.

Thanks



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] task_work: remove fifo ordering guarantee
  2015-08-31  5:22         ` yalin wang
@ 2015-09-05  5:19           ` Al Viro
  0 siblings, 0 replies; 20+ messages in thread
From: Al Viro @ 2015-09-05  5:19 UTC (permalink / raw)
  To: yalin wang
  Cc: Linus Torvalds, Eric Dumazet, Oleg Nesterov,
	linux-kernel@vger.kernel.org, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Maciej Żenczykowski

On Mon, Aug 31, 2015 at 01:22:26PM +0800, yalin wang wrote:
> why not provide API like:
> fput()
> fput_nosync() ?
> 
> because synchronous version are reasonable and safe in most time,
> let the user to select which version to use is more feasible, no matter if it is kthread or not.

Synchronous version is *NOT* safe in a lot of situations, from "deep enough
in kernel stack" to "now a function seven levels out in call chain happens
to hold a mutex grabbed elsewhere inside a mutex taken by unexpected
->release() instance, causing a deadlock", etc.

It's not sync vs. async; we still guarantee execution before return from
syscall.  The only case when we really get async is kernel threads - there
we do *not* return to userland at all, so we have to schedule it really
asynchronous.  Which is why we need an explicit sync version (for kernel
threads only, not exported, don't use unless you really understand what
you are doing and can explain why that particular case is safe, etc.)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] task_work: remove fifo ordering guarantee
  2015-08-29 17:08       ` Linus Torvalds
  2015-08-31  5:22         ` yalin wang
@ 2015-08-31 12:44         ` Oleg Nesterov
  2015-09-05  5:12         ` Al Viro
  2 siblings, 0 replies; 20+ messages in thread
From: Oleg Nesterov @ 2015-08-31 12:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, Al Viro, linux-kernel@vger.kernel.org,
	Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Maciej Żenczykowski

On 08/29, Linus Torvalds wrote:
>
> On Sat, Aug 29, 2015 at 7:11 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >
> > If this needs to be kept, maybe then add following, to make sure
> > we flush the list at most every BITS_PER_LONG files
>
> Hmm.
>
> I'm wondering if we should just make close_files() (or maybe even
> filp_close()) use a synchronous fput().

Heh. I thought about the same change. So perhaps it is even the right
thing to do. Still I am worried, because "it can't be that simple" ;)

And, with this change close_files() is called before exit_fs() and
exit_task_namespaces(). This is fine (iiuc), but this means that the
creative code in drivers/ can (wrongly) rely on this fact again. IIRC,
the change which moved __fput() into task_work_exit() uncovered some
interesting problems, like filp_open() called from fop->release().

Anyway, this is the question to Al, I guess.

Oleg.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] task_work: remove fifo ordering guarantee
  2015-08-29 17:08       ` Linus Torvalds
  2015-08-31  5:22         ` yalin wang
  2015-08-31 12:44         ` Oleg Nesterov
@ 2015-09-05  5:12         ` Al Viro
  2015-09-05  5:42           ` Al Viro
  2 siblings, 1 reply; 20+ messages in thread
From: Al Viro @ 2015-09-05  5:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, Oleg Nesterov, linux-kernel@vger.kernel.org,
	Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Maciej Żenczykowski

On Sat, Aug 29, 2015 at 10:08:30AM -0700, Linus Torvalds wrote:
> Hmm.
> 
> I'm wondering if we should just make close_files() (or maybe even
> filp_close()) use a synchronous fput().
> 
> Iirc, the reason we delay fput() is that we had some nasty issues for
> the generic fput case. It was called from interrupt context by the aio
> code, and just in general there's a lot of nasty cases that can cause
> the final fput to happen (so there are lockdep issues with the mmap
> locks because the last fput being from munmap etc).
> 
> Maybe I forget some detail - it's been several years by now - but I
> think we could make the regular "close()" and "exit()" cases just use
> the synchronous fput (it's called "__fput_sync()" and currently
> explicitly limited to just kernel threads).
> 
> Al?

First of all, we'd better not count on e.g. delayed fput() *NOT* doing
task_work_add() - we still need to check if any new work had been added.
After all, final close() might very well have done a final mntput()
on a lazy-unmounted filesystem, possibly leaving us with fs shutdown via
task_work_add().  And if that sucker e.g. closes a socket, well, we are
back to closing an opened struct file, with task_work_add() etc.

I'm a bit nervious about filp_close() (that sucker is exported and widely
abused), but close_files()... sure, shouldn't be a problem.  And yes,
we can teach __close_fd() to do the same.  I really don't understand what's
the benefit, though - it's about the case when we are closing the last
descriptor for given opened file, so I would be rather surprised if slower
path taken on the way out to userland was not lost in noise...

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] task_work: remove fifo ordering guarantee
  2015-09-05  5:12         ` Al Viro
@ 2015-09-05  5:42           ` Al Viro
  2015-09-05 20:46             ` Linus Torvalds
  0 siblings, 1 reply; 20+ messages in thread
From: Al Viro @ 2015-09-05  5:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, Oleg Nesterov, linux-kernel@vger.kernel.org,
	Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Maciej Żenczykowski

On Sat, Sep 05, 2015 at 06:12:34AM +0100, Al Viro wrote:

> First of all, we'd better not count on e.g. delayed fput() *NOT* doing
> task_work_add() - we still need to check if any new work had been added.
> After all, final close() might very well have done a final mntput()
> on a lazy-unmounted filesystem, possibly leaving us with fs shutdown via
> task_work_add().  And if that sucker e.g. closes a socket, well, we are
> back to closing an opened struct file, with task_work_add() etc.
> 
> I'm a bit nervious about filp_close() (that sucker is exported and widely
> abused), but close_files()... sure, shouldn't be a problem.  And yes,
> we can teach __close_fd() to do the same.  I really don't understand what's
> the benefit, though - it's about the case when we are closing the last
> descriptor for given opened file, so I would be rather surprised if slower
> path taken on the way out to userland was not lost in noise...

OK, having found the beginning of the thread, I understand what is being
attempted, but... why the hell bother with FIFO in the first place?  AFAICS,
task_work_add() uses in VFS (final fput() and final mntput() alike)
do not care about the FIFO at all.

Sure, some out-of-tree mer^H^Hodule might rely on that.  So what?

IMO, unless we have a good in-tree reason for insisting on FIFO, dropping it
is the most obvious solution...

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] task_work: remove fifo ordering guarantee
  2015-09-05  5:42           ` Al Viro
@ 2015-09-05 20:46             ` Linus Torvalds
  0 siblings, 0 replies; 20+ messages in thread
From: Linus Torvalds @ 2015-09-05 20:46 UTC (permalink / raw)
  To: Al Viro
  Cc: Eric Dumazet, Oleg Nesterov, linux-kernel@vger.kernel.org,
	Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Maciej Żenczykowski

On Fri, Sep 4, 2015 at 10:42 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> OK, having found the beginning of the thread, I understand what is being
> attempted, but... why the hell bother with FIFO in the first place?  AFAICS,
> task_work_add() uses in VFS (final fput() and final mntput() alike)
> do not care about the FIFO at all.
>
> Sure, some out-of-tree mer^H^Hodule might rely on that.  So what?
>
> IMO, unless we have a good in-tree reason for insisting on FIFO, dropping it
> is the most obvious solution...

I agree. We should just try that.

I'll apply Eric's patch from the beginning of this tree, and let's
just see if anybody ever notices.

Removing code and possibly fixing a latency issue sounds like a win-win.

              Linus

^ permalink raw reply	[flat|nested] 20+ messages in thread

* change filp_close() to use __fput_sync() ? (Was: [PATCH] task_work: remove fifo ordering guarantee)
  2015-08-29 13:57   ` Eric Dumazet
  2015-08-29 14:11     ` Eric Dumazet
@ 2015-08-31 12:05     ` Oleg Nesterov
  1 sibling, 0 replies; 20+ messages in thread
From: Oleg Nesterov @ 2015-08-31 12:05 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Al Viro, Linus Torvalds, linux-kernel@vger.kernel.org,
	Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Maciej Żenczykowski

On 08/29, Eric Dumazet wrote:
>
> On Sat, 2015-08-29 at 14:49 +0200, Oleg Nesterov wrote:
> > On 08/28, Eric Dumazet wrote:
> > >
> > > From: Eric Dumazet <edumazet@google.com>
> > >
> > > In commit f341861fb0b ("task_work: add a scheduling point in
> > > task_work_run()") I fixed a latency problem adding a cond_resched()
> > > call.
> > >
> > > Later, commit ac3d0da8f329 added yet another loop to reverse a list,
> > > bringing back the latency spike :
> > >
> > > I've seen in some cases this loop taking 275 ms, if for example a
> > > process with 2,000,000 files is killed.
> > >
> > > We could add yet another cond_resched() in the reverse loop,
> >
> > Can't we do this?
>
> Well, I stated in the changelog we could. Obviously we can.
>
> Adding 275 ms of pure overhead to perform this list reversal for files
> to be closed is quite unfortunate.

Well, if the first loop takes 275 ms, then probably the next one which
actually does a lot of __fput's takes much, much more time, so perhaps
these 275 ms are not very noticable. Ignoring the latency problem.

But of course, this is not good, I agree. Please see below.

> > Fifo just looks more sane to me.
>
> Well, files are closed in a random order. These are the main user of
> this stuff.

This is the most "heavy" user. But task_works is the generic API.

> Now we also could question why we needed commit
> 4a9d4b024a3102fc083c925c242d98ac27b1c5f6 ("switch fput to task_work_add
> ") since it seems quite an overhead at task exit with 10^6 of files to
> close.

How about the patch below? I didn't try to test it yet, but since
filp_close() does ->flush() I think __fput_sync() should be safe here.

Al, what do you think?

Oleg.


--- x/fs/file_table.c
+++ x/fs/file_table.c
@@ -292,11 +292,8 @@ void fput(struct file *file)
  */
 void __fput_sync(struct file *file)
 {
-	if (atomic_long_dec_and_test(&file->f_count)) {
-		struct task_struct *task = current;
-		BUG_ON(!(task->flags & PF_KTHREAD));
+	if (atomic_long_dec_and_test(&file->f_count))
 		__fput(file);
-	}
 }
 
 EXPORT_SYMBOL(fput);
--- x/fs/open.c
+++ x/fs/open.c
@@ -1074,7 +1074,7 @@ int filp_close(struct file *filp, fl_owner_t id)
 		dnotify_flush(filp, id);
 		locks_remove_posix(filp, id);
 	}
-	fput(filp);
+	__fput_sync(filp);
 	return retval;
 }
 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] task_work: remove fifo ordering guarantee
  2015-08-29 12:49 ` Oleg Nesterov
  2015-08-29 13:57   ` Eric Dumazet
@ 2015-09-05  5:35   ` Al Viro
  2015-09-07 12:27     ` [PATCH?] fput: don't abuse task_work_add() too much Oleg Nesterov
  1 sibling, 1 reply; 20+ messages in thread
From: Al Viro @ 2015-09-05  5:35 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Eric Dumazet, Linus Torvalds, linux-kernel@vger.kernel.org,
	Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Maciej Żenczykowski

On Sat, Aug 29, 2015 at 02:49:21PM +0200, Oleg Nesterov wrote:

> Let's look keyctl_session_to_parent(). It does task_work_cancel()
> but only because we can not trust user-space. Otherwise we could
> remove it and just do task_work_add(), but this needs fifo.
> 
> Fifo just looks more sane to me.

Not if it costs us.  As far as files closing is concerned, the order
really doesn't matter.  Ditto for final mntput() uses of that stuff.

What *does* matter is task_work_add() issued by callback not getting lost.

IMO the obvious solution is to lose the reordering...

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH?] fput: don't abuse task_work_add() too much
  2015-09-05  5:35   ` [PATCH] task_work: remove fifo ordering guarantee Al Viro
@ 2015-09-07 12:27     ` Oleg Nesterov
  2015-09-07 13:49       ` [PATCH? v2] " Oleg Nesterov
  0 siblings, 1 reply; 20+ messages in thread
From: Oleg Nesterov @ 2015-09-07 12:27 UTC (permalink / raw)
  To: Al Viro, Linus Torvalds
  Cc: Eric Dumazet, linux-kernel@vger.kernel.org, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Maciej Żenczykowski

On 09/05, Al Viro wrote:
>
> On Sat, Aug 29, 2015 at 02:49:21PM +0200, Oleg Nesterov wrote:
>
> > Let's look keyctl_session_to_parent(). It does task_work_cancel()
> > but only because we can not trust user-space. Otherwise we could
> > remove it and just do task_work_add(), but this needs fifo.
> >
> > Fifo just looks more sane to me.
>
> Not if it costs us.

OK, Eric reports this trivial lockless loop takes 275ms. Apparently the
list of ____fput() works is huge. Now, how much does it cost to _create_
this huge list using task_work_add() which does cmpxchg() ? I guess a bit
more. This is what should be fixed if we think this hurts performance-wise.

And again, ignoring the latency problem due to the lack of cond_resched,
I am wondering if these 275ms are actually noticable compared to the time
the next loop needs to call all these ____fput's.

> As far as files closing is concerned, the order
> really doesn't matter.  Ditto for final mntput() uses of that stuff.

Sure, fput/mntput do not care about the ordering. And more, they do
not even really need task_work_add(). We can just remove it from fput()
and everything will work fine. Correctness-wise, I mean.

And yes, unfortunately we do not have in-kernel users which already
rely on fifo.

But task_work_add() is the generic API, loosing the ordering makes
it less useful or at least less convenient.

Just for (yes sure, artificial) example. Suppose we want to implement
sys_change_process_flags(int pid, uint set, uint clr). Only current
can change its ->flags, so we can use task_work_add() to do this. But
obviously only if it is fifo.

fput() differs because it does not care which process actually does
__fput(). And thus imo we should not count this user if we need to
decide do we need fifo or not.

> IMO the obvious solution is to lose the reordering...

Oh, I disagree. But I guess I can't convince you/Eric/Linus, so I have
to shut up.

Damn. But I can't relax ;) Al, Linus, could you comment the patch below?

Not for inclusion, lacks the changelog/testing, fput() can be simplified.
But as you can see it is simple. With this patch task_work_add(____fput)
will be called only once by (say) do_exit() path. ->fput_list does not
need any serialization / atomic ops / etc. Probably we also need to move
cond_resched() from task_work_run() to ____fput() after this patch.

Again, it is not that I think this actually makes sense, but since you
dislike these 275ms...

What do you think?

Oleg.
---

diff --git a/fs/file_table.c b/fs/file_table.c
index 294174d..36af701 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -241,7 +241,13 @@ static void delayed_fput(struct work_struct *unused)

 static void ____fput(struct callback_head *work)
 {
-	__fput(container_of(work, struct file, f_u.fu_rcuhead));
+	struct task_struct *task = current;
+	do {
+		struct file *file = task->fput_list;
+		task->fput_list = file->f_u.fu_next;
+		__fput(file);
+
+	} while (task->fput_list);
 }

 /*
@@ -267,9 +273,19 @@ void fput(struct file *file)
 		struct task_struct *task = current;

 		if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) {
+			if (task->fput_list) {
+				/* task_work_add() below was already called */
+				file->f_u.fu_next = task->fput_list;
+				task->fput_list = file;
+				return;
+			}
+
 			init_task_work(&file->f_u.fu_rcuhead, ____fput);
-			if (!task_work_add(task, &file->f_u.fu_rcuhead, true))
+			if (!task_work_add(task, &file->f_u.fu_rcuhead, true)) {
+				file->f_u.fu_next = NULL;
+				task->fput_list = file;
 				return;
+			}
 			/*
 			 * After this task has run exit_task_work(),
 			 * task_work_add() will fail.  Fall through to delayed
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 0774487..73fe16c 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -830,6 +830,7 @@ struct file {
 	union {
 		struct llist_node	fu_llist;
 		struct rcu_head 	fu_rcuhead;
+		struct file		*fu_next;
 	} f_u;
 	struct path		f_path;
 	struct inode		*f_inode;	/* cached value */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index f192cfe..6f704ff 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1477,6 +1477,7 @@ struct task_struct {
 	struct fs_struct *fs;
 /* open file information */
 	struct files_struct *files;
+	struct file *fput_list;
 /* namespaces */
 	struct nsproxy *nsproxy;
 /* signal handlers */
diff --git a/kernel/fork.c b/kernel/fork.c
index 03c1eaa..77c0a50 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1007,6 +1007,7 @@ static int copy_files(unsigned long clone_flags, struct task_struct *tsk)
 	struct files_struct *oldf, *newf;
 	int error = 0;

+	tsk->fput_list = NULL;
 	/*
 	 * A background process may not have any files ...
 	 */

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH? v2] fput: don't abuse task_work_add() too much
  2015-09-07 12:27     ` [PATCH?] fput: don't abuse task_work_add() too much Oleg Nesterov
@ 2015-09-07 13:49       ` Oleg Nesterov
  0 siblings, 0 replies; 20+ messages in thread
From: Oleg Nesterov @ 2015-09-07 13:49 UTC (permalink / raw)
  To: Al Viro, Linus Torvalds
  Cc: Eric Dumazet, linux-kernel@vger.kernel.org, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Maciej Żenczykowski

On 09/07, Oleg Nesterov wrote:
>
> Oh, I disagree. But I guess I can't convince you/Eric/Linus, so I have
> to shut up.
>
>
> Damn. But I can't relax ;) Al, Linus, could you comment the patch below?
>
> Not for inclusion, lacks the changelog/testing, fput() can be simplified.
> But as you can see it is simple. With this patch task_work_add(____fput)
> will be called only once by (say) do_exit() path. ->fput_list does not
> need any serialization / atomic ops / etc. Probably we also need to move
> cond_resched() from task_work_run() to ____fput() after this patch.
>
> Again, it is not that I think this actually makes sense, but since you
> dislike these 275ms...
>
> What do you think?

Yes, task_struct->fput_list is ugly. We can avoid it, but then we need
another ->next pointer in struct file. Perhaps we can reuse ->f_version?

This way the change looks really simple and not too bad to me. Although
I am not sure you will agree.

Oleg.
---

diff --git a/fs/file_table.c b/fs/file_table.c
index 294174d..c34b666 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -241,7 +241,15 @@ static void delayed_fput(struct work_struct *unused)
 
 static void ____fput(struct callback_head *work)
 {
-	__fput(container_of(work, struct file, f_u.fu_rcuhead));
+	struct file *file = container_of(work, struct file, f_u.fu_rcuhead);
+	struct file *next;
+
+	do {
+		next = file->f_next_put;
+		__fput(file);
+		file = next;
+
+	} while (file);
 }
 
 /*
@@ -267,9 +275,21 @@ void fput(struct file *file)
 		struct task_struct *task = current;
 
 		if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) {
+			struct callback_head *work = READ_ONCE(task->task_works);
+			struct file *prev;
+
+			if (work && work->func == ____fput) {
+				prev = container_of(work, struct file, f_u.fu_rcuhead);
+				file->f_next_put = prev->f_next_put;
+				prev->f_next_put = file;
+				return;
+			}
+
 			init_task_work(&file->f_u.fu_rcuhead, ____fput);
-			if (!task_work_add(task, &file->f_u.fu_rcuhead, true))
+			if (!task_work_add(task, &file->f_u.fu_rcuhead, true)) {
+				file->f_next_put = NULL;
 				return;
+			}
 			/*
 			 * After this task has run exit_task_work(),
 			 * task_work_add() will fail.  Fall through to delayed
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 0774487..9381527 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -849,7 +849,10 @@ struct file {
 	const struct cred	*f_cred;
 	struct file_ra_state	f_ra;
 
-	u64			f_version;
+	union {
+		u64		f_version;
+		struct file	*f_next_put;
+	};
 #ifdef CONFIG_SECURITY
 	void			*f_security;
 #endif


^ permalink raw reply related	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2015-09-07 13:52 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-08-29  2:42 [PATCH] task_work: remove fifo ordering guarantee Eric Dumazet
2015-08-29  3:19 ` Linus Torvalds
2015-08-29  9:22   ` Ingo Molnar
2015-08-29 12:54     ` Oleg Nesterov
2015-08-31  6:02       ` Ingo Molnar
2015-08-31 12:51         ` Oleg Nesterov
2015-08-29 12:49 ` Oleg Nesterov
2015-08-29 13:57   ` Eric Dumazet
2015-08-29 14:11     ` Eric Dumazet
2015-08-29 17:08       ` Linus Torvalds
2015-08-31  5:22         ` yalin wang
2015-09-05  5:19           ` Al Viro
2015-08-31 12:44         ` Oleg Nesterov
2015-09-05  5:12         ` Al Viro
2015-09-05  5:42           ` Al Viro
2015-09-05 20:46             ` Linus Torvalds
2015-08-31 12:05     ` change filp_close() to use __fput_sync() ? (Was: [PATCH] task_work: remove fifo ordering guarantee) Oleg Nesterov
2015-09-05  5:35   ` [PATCH] task_work: remove fifo ordering guarantee Al Viro
2015-09-07 12:27     ` [PATCH?] fput: don't abuse task_work_add() too much Oleg Nesterov
2015-09-07 13:49       ` [PATCH? v2] " Oleg Nesterov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox