* Re: epoll oops.
2013-10-14 20:57 ` Linus Torvalds
@ 2013-10-15 15:48 ` Oleg Nesterov
2013-10-15 19:28 ` Dave Jones
2013-10-16 22:39 ` Eric Wong
2013-10-23 9:08 ` Pekka Enberg
2013-10-23 13:43 ` Peter Hurley
2 siblings, 2 replies; 10+ messages in thread
From: Oleg Nesterov @ 2013-10-15 15:48 UTC (permalink / raw)
To: Linus Torvalds, Dave Jones
Cc: Linux Kernel, Al Viro, Davide Libenzi, Eric Wong, Pekka Enberg,
Peter Hurley
> Hmm? There might be other cases..
Yes.
Dave, perhaps you have vmcore? I have no idea if this is possible or
not, but perhaps you can look at eventpoll_release_file's frame and
print file->f_op ?
On 10/14, Linus Torvalds wrote:
>
> [ Adding Pekka to verify the SLAB_DESTROY_BY_RCU semantics
Just in case, we depend on SLAB_DESTROY_BY_RCU anyway, and ->sighand
in particular. lock_task_sighand() equally depends on it.
> Ok, Oleg, going back to that whole thread, I think that old bug went like this:
Yes, yes, thanks, I do remember what this patch does and why. Just
I forgot everything about eventpoll.c, I tried to read it only once
to fix that bug.
> (b) signalfd is special, and it does a
>
> poll_wait(file, ¤t->sighand->signalfd_wqh);
>
> which means that the wait-queue isn't associated with the file
> lifetime at all. It cleans it up with signalfd_cleanup() if the signal
> handlers are removed. Normal (non-epoll) handling is safe, because
> "current->sighand" obviously cannot go away as long as the current
> thread (doing the polling) is in its poll/select handling.
Yes. and, just in case, the main problem is that sighand has no any
connection with the file. Unlike, say, tty which uses ->private_data.
> (c) as a result, epoll and exit() can race, since the normal epoll
> cleanup() is serialized by the file being closed, and we're missing
> that for the case of sighand going away.
Yes. Before that 971316f0503a hack epoll can't even know if the task
which did signalfd_poll() exits and frees the active signalfd_wqh.
If for example that task forked a child before exit.
And the whole RCU logic is only needed if exit/ep_remove_wait_queue
actually race with each other.
> Agreed so far? Ugly, ugly, ugly,
Yes, ugly, agreed. d80e731ecab4 even tries to docunent that this all
is the hack.
> And it looks like it should work.
Yes... I tried to read this all again, and so far I do not see
anything wrong... signalfd_cleanup()->waitqueue_active() looks fine
too, afaics. We do not need to clear ->whead unconditionally, the
only caller of ep_ptable_queue_proc() is signalfd_poll(), and we are
the last thread which can use this ->sighand.
> Peter? Does a tty hangup end up actually possibly freeing the tty
> struct? Looking at it, I'm starting to think that it only affects
> f_op, and the "struct tty" stays around, in which case this is all
> fine.
Of course I can't answer, but at first glance file_tty() should not go
away in this case... If nothing else, tty_release() expects tty != NULL,
and it seems that priv->tty is never changed.
Oleg.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: epoll oops.
2013-10-15 15:48 ` Oleg Nesterov
@ 2013-10-15 19:28 ` Dave Jones
2013-10-16 22:39 ` Eric Wong
1 sibling, 0 replies; 10+ messages in thread
From: Dave Jones @ 2013-10-15 19:28 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Linus Torvalds, Linux Kernel, Al Viro, Davide Libenzi, Eric Wong,
Pekka Enberg, Peter Hurley
On Tue, Oct 15, 2013 at 05:48:38PM +0200, Oleg Nesterov wrote:
> > Hmm? There might be other cases..
>
> Yes.
>
> Dave, perhaps you have vmcore?
afraid not. I probably should set up kdump on my test machines. I'll look into it.
thanks,
Dave
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: epoll oops.
2013-10-15 15:48 ` Oleg Nesterov
2013-10-15 19:28 ` Dave Jones
@ 2013-10-16 22:39 ` Eric Wong
2013-10-17 13:53 ` Oleg Nesterov
1 sibling, 1 reply; 10+ messages in thread
From: Eric Wong @ 2013-10-16 22:39 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Linus Torvalds, Dave Jones, Linux Kernel, Al Viro, Davide Libenzi,
Pekka Enberg, Peter Hurley, Eric Dumazet
Oleg Nesterov <oleg@redhat.com> wrote:
> Yes. Before that 971316f0503a hack epoll can't even know if the task
> which did signalfd_poll() exits and frees the active signalfd_wqh.
> If for example that task forked a child before exit.
>
> And the whole RCU logic is only needed if exit/ep_remove_wait_queue
> actually race with each other.
Is there any chance this oops is caused by (or at least more easily
exposed by) commit 91cf5ab60ff82ecf4550a596867787c1e360dd3f ?
(epoll: add a reschedule point in ep_free())
I thought 91cf5ab would be benign, except...
> Yes, ugly, agreed. d80e731ecab4 even tries to docunent that this all
> is the hack.
.. the following sentence from d80e731ecab4 caught my eye:
It also assumes that nobody can take tasklist_lock under epoll
locks, this seems to be true.
I haven't been able to trace if cond_resched() can take tasklist_lock.
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: epoll oops.
2013-10-16 22:39 ` Eric Wong
@ 2013-10-17 13:53 ` Oleg Nesterov
0 siblings, 0 replies; 10+ messages in thread
From: Oleg Nesterov @ 2013-10-17 13:53 UTC (permalink / raw)
To: Eric Wong
Cc: Linus Torvalds, Dave Jones, Linux Kernel, Al Viro, Davide Libenzi,
Pekka Enberg, Peter Hurley, Eric Dumazet
On 10/16, Eric Wong wrote:
>
> Oleg Nesterov <oleg@redhat.com> wrote:
> > Yes. Before that 971316f0503a hack epoll can't even know if the task
> > which did signalfd_poll() exits and frees the active signalfd_wqh.
> > If for example that task forked a child before exit.
> >
> > And the whole RCU logic is only needed if exit/ep_remove_wait_queue
> > actually race with each other.
>
> Is there any chance this oops is caused by (or at least more easily
> exposed by) commit 91cf5ab60ff82ecf4550a596867787c1e360dd3f ?
> (epoll: add a reschedule point in ep_free())
>
> I thought 91cf5ab would be benign, except...
>
> > Yes, ugly, agreed. d80e731ecab4 even tries to docunent that this all
> > is the hack.
>
> .. the following sentence from d80e731ecab4 caught my eye:
>
> It also assumes that nobody can take tasklist_lock under epoll
> locks, this seems to be true.
This just reminds that with this patch __wake_up/ep_poll_callback can
be called under write_lock(tasklist).
> I haven't been able to trace if cond_resched() can take tasklist_lock.
No, it can't hold the non-sleepable rwlock_t. And the sentence above
doesn't mean the locks like epmutex, it is mostlt about ep->lock.
Oleg.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: epoll oops.
2013-10-14 20:57 ` Linus Torvalds
2013-10-15 15:48 ` Oleg Nesterov
@ 2013-10-23 9:08 ` Pekka Enberg
2013-10-23 13:43 ` Peter Hurley
2 siblings, 0 replies; 10+ messages in thread
From: Pekka Enberg @ 2013-10-23 9:08 UTC (permalink / raw)
To: Linus Torvalds
Cc: Dave Jones, Linux Kernel, Al Viro, Davide Libenzi, Eric Wong,
Oleg Nesterov, Peter Hurley
Hi Linus,
On Mon, Oct 14, 2013 at 10:57 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> [ Adding Pekka to verify the SLAB_DESTROY_BY_RCU semantics, and Peter
> Hurley due to the possible tty association ]
>
> On Mon, Oct 14, 2013 at 10:31 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> Oleg, does this trigger any memory for you? Commit 971316f0503a
>> ("epoll: ep_unregister_pollwait() can use the freed pwq->whead") just
>> makes me go "Hmm, this is *exactly* that that commit is talking
>> about.."
>
> Ok, Oleg, going back to that whole thread, I think that old bug went like this:
>
> (a) normally all the wait-queues that epoll accesses are associated
> with files, and as such they cannot go away for any normal file
> activity. If the file exists, the waitqueue used for poll() on that
> file must exist.
>
> (b) signalfd is special, and it does a
>
> poll_wait(file, ¤t->sighand->signalfd_wqh);
>
> which means that the wait-queue isn't associated with the file
> lifetime at all. It cleans it up with signalfd_cleanup() if the signal
> handlers are removed. Normal (non-epoll) handling is safe, because
> "current->sighand" obviously cannot go away as long as the current
> thread (doing the polling) is in its poll/select handling.
>
> (c) as a result, epoll and exit() can race, since the normal epoll
> cleanup() is serialized by the file being closed, and we're missing
> that for the case of sighand going away.
>
> (d) we have this magic POLLFREE protocol to make signal handling
> cleanup inform the epoll logic that "oops, this is going away", and we
> depend on the underlying sighand data not going away thanks to the
> eventual destruction of the slab being delayed by RCU.
>
> (e) we are also very careful to only ever initialize the signalfd_wqh
> entry in the SLAB *constructor*, because we cannot do it at every
> allocation: it might still be in reused as long as it exists in the
> slab cache: the SLAB_DESTROY_BY_RCU flag does *not* delay individual
> slab entries, it only delays the final free of the underlying memory
> allocation.
>
> (f) to make things even more exciting, the SLAB_DESTROY_BY_RCU depend
> on the slab implementation: slub and slob seem to delay each
> individual allocation (and do ctor/dtor on every allocation), while
> slab does that "delay only the underlying big page allocator" thing.
So I'm not completely sure what you wanted me to verify Linus but yes
SLAB_DESTROY_BY_RCU only guarantees that the underlying page doesn't
go away for RCU but we're free to reuse the object. Anyone using the
object passed to kmem_cache_free() with SLAB_DESTROY_BY_RCU must check
that it's in fact the object we're interested in.
There's example code in a SLAB_DESTROY_BY_RCU comment in
<linux/slab.h> added by PeterZ.
Pekka
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: epoll oops.
2013-10-14 20:57 ` Linus Torvalds
2013-10-15 15:48 ` Oleg Nesterov
2013-10-23 9:08 ` Pekka Enberg
@ 2013-10-23 13:43 ` Peter Hurley
2 siblings, 0 replies; 10+ messages in thread
From: Peter Hurley @ 2013-10-23 13:43 UTC (permalink / raw)
To: Linus Torvalds, Dave Jones, Al Viro, Oleg Nesterov
Cc: Linux Kernel, Davide Libenzi, Eric Wong, Pekka Enberg
On 10/14/2013 04:57 PM, Linus Torvalds wrote:
> [ Adding Pekka to verify the SLAB_DESTROY_BY_RCU semantics, and Peter
> Hurley due to the possible tty association ]
> And I see a few worrisome cases. For example, look at "tty_poll()". It
> ends up doing something very similar, except it uses the tty instead
> of sighand. And exactly like the sighand struct, the tty allocation
> lifespan can - thanks to hangup() - be shorter than the file
> allocation lifespan.
>
> Peter? Does a tty hangup end up actually possibly freeing the tty
> struct? Looking at it, I'm starting to think that it only affects
> f_op, and the "struct tty" stays around, in which case this is all
> fine.
The tty_struct is only freed at the completion of the tty's
file_operations .release method (tty_release()). Further, it should
not be possible to advance past the tty_ldisc_release() call in
tty_release() while file operations such as tty_poll() -> poll_wait()
or a tty hangup are in-progress.
[Notwithstanding the above, if some kernel driver failed to acquire
a tty reference, either directly or via tty_port_tty_hangup(), before
hanging up, then the hangup could be racing with the .release(). But
I don't think that's what's happening here.]
On 10/15/2013 11:48 AM, Oleg Nesterov wrote:>> Hmm? There might be other cases..
>
> Yes.
>
> Dave, perhaps you have vmcore? I have no idea if this is possible or
> not, but perhaps you can look at eventpoll_release_file's frame and
> print file->f_op ?
I think Oleg's suggestion is the next diagnostic step.
Regards,
Peter Hurley
^ permalink raw reply [flat|nested] 10+ messages in thread