* process 'stuck' at exit. @ 2013-12-10 15:47 Dave Jones 2013-12-10 18:40 ` Linus Torvalds 0 siblings, 1 reply; 114+ messages in thread From: Dave Jones @ 2013-12-10 15:47 UTC (permalink / raw) To: Linus Torvalds; +Cc: Linux Kernel I woke up to find my fuzzer in a curious state. 1121 pts/5 SN+ 0:00 | \_ ../trinity -q -l off -N 999999 -C 42 1130 pts/5 SN+ 0:01 | \_ ../trinity -q -l off -N 999999 -C 42 1131 pts/5 SN+ 0:17 | \_ ../trinity -q -l off -N 999999 -C 42 10818 ? RNs 21115152:53 | \_ ../trinity -q -l off -N 999999 -C 42 (ignore the first 3 pids, they're waiting on 10818, which is the interesting one) It's completed its run of 999999 syscalls, and looking at tmux, it tried to exit. /proc/10818/stack just shows [<ffffffffffffffff>] 0xffffffffffffffff Top reports a core is spinning in the kernel 100%, so I ran perf top -a and saw.. 8.63% [kernel] [k] trace_hardirqs_off_caller 7.68% [kernel] [k] __lock_acquire 5.35% [kernel] [k] gup_huge_pmd 5.31% [kernel] [k] put_compound_page 4.93% [kernel] [k] gup_pud_range 4.76% [kernel] [k] trace_hardirqs_on_caller 4.58% [kernel] [k] get_user_pages_fast 4.00% [kernel] [k] debug_smp_processor_id 4.00% [kernel] [k] lock_is_held 3.73% [kernel] [k] lock_acquired 3.67% [kernel] [k] lock_release sysrq-t shows.. trinity-child27 R running task 5520 10818 1131 0x00080004 0000000000000000 ffff8801b0ef4170 000000000000032c ffff8801b609e108 0000000000000000 ffff880160d21c30 ffffffff810ad895 ffffffff817587a0 ffff8801b0ef4170 ffff8801b609e0a8 ffff8801b609e000 ffff880160d21d50 Call Trace: [<ffffffff817587a0>] ? retint_restore_args+0xe/0xe [<ffffffff8132af0e>] ? trace_hardirqs_on_thunk+0x3a/0x3f [<ffffffff8100b184>] ? native_sched_clock+0x24/0x80 [<ffffffff8109624f>] ? local_clock+0xf/0x50 [<ffffffff810aa27e>] ? put_lock_stats.isra.28+0xe/0x30 [<ffffffff8103edd0>] ? gup_pud_range+0x170/0x190 [<ffffffff8103f0d5>] ? get_user_pages_fast+0x1a5/0x1c0 [<ffffffff810ad1f5>] ? trace_hardirqs_on_caller+0x115/0x1e0 [<ffffffff810a8a2f>] ? up_read+0x1f/0x40 [<ffffffff8103f0d5>] ? get_user_pages_fast+0x1a5/0x1c0 [<ffffffff8115f76c>] ? put_page+0x3c/0x50 [<ffffffff810dd525>] ? get_futex_key+0xd5/0x2c0 [<ffffffff810df18a>] ? futex_requeue+0xfa/0x9c0 [<ffffffff810e019e>] ? do_futex+0xae/0xc80 [<ffffffff810aa27e>] ? put_lock_stats.isra.28+0xe/0x30 [<ffffffff810aa7de>] ? lock_release_holdtime.part.29+0xee/0x170 [<ffffffff8114f16e>] ? context_tracking_user_exit+0x4e/0x190 [<ffffffff810ad1f5>] ? trace_hardirqs_on_caller+0x115/0x1e0 [<ffffffff810e0de1>] ? SyS_futex+0x71/0x150 [<ffffffff81010a45>] ? syscall_trace_enter+0x145/0x2a0 [<ffffffff81760be4>] ? tracesys+0xdd/0xe2 any ideas ? Dave ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: process 'stuck' at exit. 2013-12-10 15:47 process 'stuck' at exit Dave Jones @ 2013-12-10 18:40 ` Linus Torvalds 2013-12-10 19:18 ` Thomas Gleixner 2013-12-10 20:57 ` Darren Hart 0 siblings, 2 replies; 114+ messages in thread From: Linus Torvalds @ 2013-12-10 18:40 UTC (permalink / raw) To: Dave Jones, Thomas Gleixner, Darren Hart, Andrea Arcangeli Cc: Linux Kernel Mailing List Hmm. Looks like the futex code is somehow stuck in a loop, calling get_user_pages_fast(). The futex code itself is apparently so low-overhead that it doesn't show up in your 'perf top' report (which is dominated by all the expensive debug things that get_user_pages_fast() etc ends up doing), but that's the only looping I can see. Perhaps the "goto again" case for transparent huge pages in get_futex_key()? Or the "retry[_private]" cases in futex_requeue()? Some error condition that causes us to retry forever, rather than returning the error code? Added a few more people to the cc.. Ideas? Linus On Tue, Dec 10, 2013 at 7:47 AM, Dave Jones <davej@redhat.com> wrote: > I woke up to find my fuzzer in a curious state. > > 1121 pts/5 SN+ 0:00 | \_ ../trinity -q -l off -N 999999 -C 42 > 1130 pts/5 SN+ 0:01 | \_ ../trinity -q -l off -N 999999 -C 42 > 1131 pts/5 SN+ 0:17 | \_ ../trinity -q -l off -N 999999 -C 42 > 10818 ? RNs 21115152:53 | \_ ../trinity -q -l off -N 999999 -C 42 > > (ignore the first 3 pids, they're waiting on 10818, which is the interesting one) > > It's completed its run of 999999 syscalls, and looking at tmux, it tried to exit. > > /proc/10818/stack just shows > > [<ffffffffffffffff>] 0xffffffffffffffff > > Top reports a core is spinning in the kernel 100%, so I ran perf top -a > and saw.. > > 8.63% [kernel] [k] trace_hardirqs_off_caller > 7.68% [kernel] [k] __lock_acquire > 5.35% [kernel] [k] gup_huge_pmd > 5.31% [kernel] [k] put_compound_page > 4.93% [kernel] [k] gup_pud_range > 4.76% [kernel] [k] trace_hardirqs_on_caller > 4.58% [kernel] [k] get_user_pages_fast > 4.00% [kernel] [k] debug_smp_processor_id > 4.00% [kernel] [k] lock_is_held > 3.73% [kernel] [k] lock_acquired > 3.67% [kernel] [k] lock_release > > > sysrq-t shows.. > > trinity-child27 R running task 5520 10818 1131 0x00080004 > 0000000000000000 ffff8801b0ef4170 000000000000032c ffff8801b609e108 > 0000000000000000 ffff880160d21c30 ffffffff810ad895 ffffffff817587a0 > ffff8801b0ef4170 ffff8801b609e0a8 ffff8801b609e000 ffff880160d21d50 > Call Trace: > [<ffffffff817587a0>] ? retint_restore_args+0xe/0xe > [<ffffffff8132af0e>] ? trace_hardirqs_on_thunk+0x3a/0x3f > [<ffffffff8100b184>] ? native_sched_clock+0x24/0x80 > [<ffffffff8109624f>] ? local_clock+0xf/0x50 > [<ffffffff810aa27e>] ? put_lock_stats.isra.28+0xe/0x30 > [<ffffffff8103edd0>] ? gup_pud_range+0x170/0x190 > [<ffffffff8103f0d5>] ? get_user_pages_fast+0x1a5/0x1c0 > [<ffffffff810ad1f5>] ? trace_hardirqs_on_caller+0x115/0x1e0 > [<ffffffff810a8a2f>] ? up_read+0x1f/0x40 > [<ffffffff8103f0d5>] ? get_user_pages_fast+0x1a5/0x1c0 > [<ffffffff8115f76c>] ? put_page+0x3c/0x50 > [<ffffffff810dd525>] ? get_futex_key+0xd5/0x2c0 > [<ffffffff810df18a>] ? futex_requeue+0xfa/0x9c0 > [<ffffffff810e019e>] ? do_futex+0xae/0xc80 > [<ffffffff810aa27e>] ? put_lock_stats.isra.28+0xe/0x30 > [<ffffffff810aa7de>] ? lock_release_holdtime.part.29+0xee/0x170 > [<ffffffff8114f16e>] ? context_tracking_user_exit+0x4e/0x190 > [<ffffffff810ad1f5>] ? trace_hardirqs_on_caller+0x115/0x1e0 > [<ffffffff810e0de1>] ? SyS_futex+0x71/0x150 > [<ffffffff81010a45>] ? syscall_trace_enter+0x145/0x2a0 > [<ffffffff81760be4>] ? tracesys+0xdd/0xe2 > > > any ideas ? > > Dave > ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: process 'stuck' at exit. 2013-12-10 18:40 ` Linus Torvalds @ 2013-12-10 19:18 ` Thomas Gleixner 2013-12-10 19:55 ` Linus Torvalds 2013-12-11 1:02 ` Mel Gorman 2013-12-10 20:57 ` Darren Hart 1 sibling, 2 replies; 114+ messages in thread From: Thomas Gleixner @ 2013-12-10 19:18 UTC (permalink / raw) To: Linus Torvalds Cc: Dave Jones, Darren Hart, Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman On Tue, 10 Dec 2013, Linus Torvalds wrote: > Hmm. Looks like the futex code is somehow stuck in a loop, calling > get_user_pages_fast(). > > The futex code itself is apparently so low-overhead that it doesn't > show up in your 'perf top' report (which is dominated by all the > expensive debug things that get_user_pages_fast() etc ends up doing), > but that's the only looping I can see. Perhaps the "goto again" case > for transparent huge pages in get_futex_key()? Or the Cc'ng more folks on that. > "retry[_private]" cases in futex_requeue()? Some error condition that > causes us to retry forever, rather than returning the error code? So this is pretty unlikely. The retry requires: get_futex_value_locked() == EFAULT; Now we drop the hash bucket locks and do: get_user(); And if that get_user() faults again, we bail out. If it succeeds we try again. So between the get_user() and the next get_futex_value_locked() the effect of get_user() must have been undone. I guess this can happen, but a gazillion times in a row? > > Added a few more people to the cc.. Ideas? ... > > Top reports a core is spinning in the kernel 100%, so I ran perf top -a > > and saw.. > > > > 8.63% [kernel] [k] trace_hardirqs_off_caller > > 7.68% [kernel] [k] __lock_acquire > > 5.35% [kernel] [k] gup_huge_pmd > > 5.31% [kernel] [k] put_compound_page > > 4.93% [kernel] [k] gup_pud_range > > 4.76% [kernel] [k] trace_hardirqs_on_caller > > 4.58% [kernel] [k] get_user_pages_fast > > 4.00% [kernel] [k] debug_smp_processor_id > > 4.00% [kernel] [k] lock_is_held > > 3.73% [kernel] [k] lock_acquired > > 3.67% [kernel] [k] lock_release > > > > > > sysrq-t shows.. > > > > trinity-child27 R running task 5520 10818 1131 0x00080004 > > 0000000000000000 ffff8801b0ef4170 000000000000032c ffff8801b609e108 > > 0000000000000000 ffff880160d21c30 ffffffff810ad895 ffffffff817587a0 > > ffff8801b0ef4170 ffff8801b609e0a8 ffff8801b609e000 ffff880160d21d50 > > Call Trace: > > [<ffffffff817587a0>] ? retint_restore_args+0xe/0xe > > [<ffffffff8132af0e>] ? trace_hardirqs_on_thunk+0x3a/0x3f > > [<ffffffff8100b184>] ? native_sched_clock+0x24/0x80 > > [<ffffffff8109624f>] ? local_clock+0xf/0x50 > > [<ffffffff810aa27e>] ? put_lock_stats.isra.28+0xe/0x30 > > [<ffffffff8103edd0>] ? gup_pud_range+0x170/0x190 > > [<ffffffff8103f0d5>] ? get_user_pages_fast+0x1a5/0x1c0 > > [<ffffffff810ad1f5>] ? trace_hardirqs_on_caller+0x115/0x1e0 > > [<ffffffff810a8a2f>] ? up_read+0x1f/0x40 > > [<ffffffff8103f0d5>] ? get_user_pages_fast+0x1a5/0x1c0 > > [<ffffffff8115f76c>] ? put_page+0x3c/0x50 > > [<ffffffff810dd525>] ? get_futex_key+0xd5/0x2c0 > > [<ffffffff810df18a>] ? futex_requeue+0xfa/0x9c0 > > [<ffffffff810e019e>] ? do_futex+0xae/0xc80 > > [<ffffffff810aa27e>] ? put_lock_stats.isra.28+0xe/0x30 > > [<ffffffff810aa7de>] ? lock_release_holdtime.part.29+0xee/0x170 > > [<ffffffff8114f16e>] ? context_tracking_user_exit+0x4e/0x190 > > [<ffffffff810ad1f5>] ? trace_hardirqs_on_caller+0x115/0x1e0 > > [<ffffffff810e0de1>] ? SyS_futex+0x71/0x150 > > [<ffffffff81010a45>] ? syscall_trace_enter+0x145/0x2a0 > > [<ffffffff81760be4>] ? tracesys+0xdd/0xe2 > > > > > > any ideas ? > > > > Dave > > > ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: process 'stuck' at exit. 2013-12-10 19:18 ` Thomas Gleixner @ 2013-12-10 19:55 ` Linus Torvalds 2013-12-10 20:27 ` Dave Jones ` (2 more replies) 2013-12-11 1:02 ` Mel Gorman 1 sibling, 3 replies; 114+ messages in thread From: Linus Torvalds @ 2013-12-10 19:55 UTC (permalink / raw) To: Thomas Gleixner Cc: Dave Jones, Darren Hart, Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman, Oleg Nesterov On Tue, Dec 10, 2013 at 11:18 AM, Thomas Gleixner <tglx@linutronix.de> wrote: > > So this is pretty unlikely. The retry requires: > > get_futex_value_locked() == EFAULT; > > Now we drop the hash bucket locks and do: > > get_user(); > > And if that get_user() faults again, we bail out. I think you need to look closer. We have at least also that "futex_proxy_trylock_atomic() returns -EAGAIN" case. Which triggers at some exit condition. Another thread in the same group, perhaps never completing the exit because it's waiting for this one? I dunno, I didn't look any closer (but this does make me think "Hey, we should add Oleg to the Cc too", since PF_EXITING is involved).. So maybe there is some situation where that EAGAIN will keep happening, forever.. Now, I'm *not* saying that that is it. It's quite possible/likely some other loop, but I do have to say that it sure isn't _obvious_. And that whole EAGAIN return case is quite deep and special, so ... Linus PS: Oleg - the whole thread is on lkml. Ping me if you need more context. ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: process 'stuck' at exit. 2013-12-10 19:55 ` Linus Torvalds @ 2013-12-10 20:27 ` Dave Jones 2013-12-10 20:34 ` Thomas Gleixner 2013-12-10 20:33 ` Thomas Gleixner 2013-12-10 20:35 ` Oleg Nesterov 2 siblings, 1 reply; 114+ messages in thread From: Dave Jones @ 2013-12-10 20:27 UTC (permalink / raw) To: Linus Torvalds Cc: Thomas Gleixner, Darren Hart, Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman, Oleg Nesterov On Tue, Dec 10, 2013 at 11:55:06AM -0800, Linus Torvalds wrote: > On Tue, Dec 10, 2013 at 11:18 AM, Thomas Gleixner <tglx@linutronix.de> wrote: > > > > So this is pretty unlikely. The retry requires: > > > > get_futex_value_locked() == EFAULT; > > > > Now we drop the hash bucket locks and do: > > > > get_user(); > > > > And if that get_user() faults again, we bail out. > > I think you need to look closer. > > We have at least also that "futex_proxy_trylock_atomic() returns > -EAGAIN" case. Which triggers at some exit condition. Another thread > in the same group, perhaps never completing the exit because it's > waiting for this one? I dunno, I didn't look any closer (but this does > make me think "Hey, we should add Oleg to the Cc too", since > PF_EXITING is involved).. So maybe there is some situation where that > EAGAIN will keep happening, forever.. > > Now, I'm *not* saying that that is it. It's quite possible/likely some > other loop, but I do have to say that it sure isn't _obvious_. And > that whole EAGAIN return case is quite deep and special, so ... > > Linus > > PS: Oleg - the whole thread is on lkml. Ping me if you need more context. btw, I've left the machine in that state, and will for as long as necesary in case someone has any ideas for further tracing experiments. Dave ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: process 'stuck' at exit. 2013-12-10 20:27 ` Dave Jones @ 2013-12-10 20:34 ` Thomas Gleixner 2013-12-10 20:55 ` Dave Jones 0 siblings, 1 reply; 114+ messages in thread From: Thomas Gleixner @ 2013-12-10 20:34 UTC (permalink / raw) To: Dave Jones Cc: Linus Torvalds, Darren Hart, Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman, Oleg Nesterov On Tue, 10 Dec 2013, Dave Jones wrote: > On Tue, Dec 10, 2013 at 11:55:06AM -0800, Linus Torvalds wrote: > > On Tue, Dec 10, 2013 at 11:18 AM, Thomas Gleixner <tglx@linutronix.de> wrote: > > > > > > So this is pretty unlikely. The retry requires: > > > > > > get_futex_value_locked() == EFAULT; > > > > > > Now we drop the hash bucket locks and do: > > > > > > get_user(); > > > > > > And if that get_user() faults again, we bail out. > > > > I think you need to look closer. > > > > We have at least also that "futex_proxy_trylock_atomic() returns > > -EAGAIN" case. Which triggers at some exit condition. Another thread > > in the same group, perhaps never completing the exit because it's > > waiting for this one? I dunno, I didn't look any closer (but this does > > make me think "Hey, we should add Oleg to the Cc too", since > > PF_EXITING is involved).. So maybe there is some situation where that > > EAGAIN will keep happening, forever.. > > > > Now, I'm *not* saying that that is it. It's quite possible/likely some > > other loop, but I do have to say that it sure isn't _obvious_. And > > that whole EAGAIN return case is quite deep and special, so ... > > > > Linus > > > > PS: Oleg - the whole thread is on lkml. Ping me if you need more context. > > btw, I've left the machine in that state, and will for as long as necesary > in case someone has any ideas for further tracing experiments. Can you gather a trace with the function tracer? That will tell us what the thing is actually doing. Thanks, tglx ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: process 'stuck' at exit. 2013-12-10 20:34 ` Thomas Gleixner @ 2013-12-10 20:55 ` Dave Jones 2013-12-10 21:25 ` Darren Hart 0 siblings, 1 reply; 114+ messages in thread From: Dave Jones @ 2013-12-10 20:55 UTC (permalink / raw) To: Thomas Gleixner Cc: Linus Torvalds, Darren Hart, Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman, Oleg Nesterov On Tue, Dec 10, 2013 at 09:34:24PM +0100, Thomas Gleixner wrote: > > > PS: Oleg - the whole thread is on lkml. Ping me if you need more context. > > > > btw, I've left the machine in that state, and will for as long as necesary > > in case someone has any ideas for further tracing experiments. > > Can you gather a trace with the function tracer? That will tell us > what the thing is actually doing. Feed me command lines, and I'll see what I can do. Dave ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: process 'stuck' at exit. 2013-12-10 20:55 ` Dave Jones @ 2013-12-10 21:25 ` Darren Hart 2013-12-10 21:28 ` Thomas Gleixner 0 siblings, 1 reply; 114+ messages in thread From: Darren Hart @ 2013-12-10 21:25 UTC (permalink / raw) To: Dave Jones, Steven Rostedt Cc: Thomas Gleixner, Linus Torvalds, Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman, Oleg Nesterov On Tue, 2013-12-10 at 15:55 -0500, Dave Jones wrote: > On Tue, Dec 10, 2013 at 09:34:24PM +0100, Thomas Gleixner wrote: > > > > > PS: Oleg - the whole thread is on lkml. Ping me if you need more context. > > > > > > btw, I've left the machine in that state, and will for as long as necesary > > > in case someone has any ideas for further tracing experiments. > > > > Can you gather a trace with the function tracer? That will tell us > > what the thing is actually doing. > > Feed me command lines, and I'll see what I can do. > > Dave > So let's see if we can determine the call chain inside futex_requeue, if there is one. We want to see if we are calling the following functions: futex.c: free_pi_state futex.c: get_futex_value_locked futex.c: futex_proxy_trylock_atomic futex.c: lookup_pi_state futex.c: fault_in_user_writeable I'm hoping we can make use of ftrace here? Consider: $ trace-cmd record -P <PID> -p function -l '*futex*' And to try outside of futex: $ trace-cmd record -P <PID> -p function Then ^C after a few seconds. The trace will be in trace.dat -- Darren Hart Intel Open Source Technology Center Yocto Project - Linux Kernel ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: process 'stuck' at exit. 2013-12-10 21:25 ` Darren Hart @ 2013-12-10 21:28 ` Thomas Gleixner 2013-12-10 21:39 ` Steven Rostedt 0 siblings, 1 reply; 114+ messages in thread From: Thomas Gleixner @ 2013-12-10 21:28 UTC (permalink / raw) To: Darren Hart Cc: Dave Jones, Steven Rostedt, Linus Torvalds, Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman, Oleg Nesterov On Tue, 10 Dec 2013, Darren Hart wrote: > On Tue, 2013-12-10 at 15:55 -0500, Dave Jones wrote: > > On Tue, Dec 10, 2013 at 09:34:24PM +0100, Thomas Gleixner wrote: > > > > > > > PS: Oleg - the whole thread is on lkml. Ping me if you need more context. > > > > > > > > btw, I've left the machine in that state, and will for as long as necesary > > > > in case someone has any ideas for further tracing experiments. > > > > > > Can you gather a trace with the function tracer? That will tell us > > > what the thing is actually doing. > > > > Feed me command lines, and I'll see what I can do. > > > > Dave > > > > So let's see if we can determine the call chain inside futex_requeue, if > there is one. We want to see if we are calling the following functions: > > futex.c: free_pi_state > futex.c: get_futex_value_locked > futex.c: futex_proxy_trylock_atomic > futex.c: lookup_pi_state > futex.c: fault_in_user_writeable > > I'm hoping we can make use of ftrace here? > > Consider: > > $ trace-cmd record -P <PID> -p function -l '*futex*' > > And to try outside of futex: > > $ trace-cmd record -P <PID> -p function > > Then ^C after a few seconds. The trace will be in trace.dat We definitely want the latter. The futex code gets compiled into some badly traceable mess. Thanks, tglx ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: process 'stuck' at exit. 2013-12-10 21:28 ` Thomas Gleixner @ 2013-12-10 21:39 ` Steven Rostedt 0 siblings, 0 replies; 114+ messages in thread From: Steven Rostedt @ 2013-12-10 21:39 UTC (permalink / raw) To: Thomas Gleixner Cc: Darren Hart, Dave Jones, Linus Torvalds, Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman, Oleg Nesterov On Tue, 10 Dec 2013 22:28:01 +0100 (CET) Thomas Gleixner <tglx@linutronix.de> wrote: > On Tue, 10 Dec 2013, Darren Hart wrote: > > On Tue, 2013-12-10 at 15:55 -0500, Dave Jones wrote: > > > On Tue, Dec 10, 2013 at 09:34:24PM +0100, Thomas Gleixner wrote: > > > > > > > > > PS: Oleg - the whole thread is on lkml. Ping me if you need more context. > > > > > > > > > > btw, I've left the machine in that state, and will for as long as necesary > > > > > in case someone has any ideas for further tracing experiments. > > > > > > > > Can you gather a trace with the function tracer? That will tell us > > > > what the thing is actually doing. > > > > > > Feed me command lines, and I'll see what I can do. > > > > > > Dave > > > > > > > So let's see if we can determine the call chain inside futex_requeue, if > > there is one. We want to see if we are calling the following functions: > > > > futex.c: free_pi_state > > futex.c: get_futex_value_locked > > futex.c: futex_proxy_trylock_atomic > > futex.c: lookup_pi_state > > futex.c: fault_in_user_writeable > > > > I'm hoping we can make use of ftrace here? > > > > Consider: > > > > $ trace-cmd record -P <PID> -p function -l '*futex*' Note the above just picks functions that have "futex" in its name, and will not include all the functions you listed above. > > > > And to try outside of futex: > > > > $ trace-cmd record -P <PID> -p function > > > > Then ^C after a few seconds. The trace will be in trace.dat > > We definitely want the latter. The futex code gets compiled into some > badly traceable mess. Yeah, lets use all functions. If its in a loop, we should see exactly what is happening, with full function tracing. You may be able to simplify it to just: trace-cmd record -p function sleep 10 and then read the trace. Nasty loopers usually dominate these types of traces so no need to filter on them in the recording. -- Steve ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: process 'stuck' at exit. 2013-12-10 19:55 ` Linus Torvalds 2013-12-10 20:27 ` Dave Jones @ 2013-12-10 20:33 ` Thomas Gleixner 2013-12-10 20:43 ` Linus Torvalds 2013-12-10 20:35 ` Oleg Nesterov 2 siblings, 1 reply; 114+ messages in thread From: Thomas Gleixner @ 2013-12-10 20:33 UTC (permalink / raw) To: Linus Torvalds Cc: Dave Jones, Darren Hart, Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman, Oleg Nesterov On Tue, 10 Dec 2013, Linus Torvalds wrote: > On Tue, Dec 10, 2013 at 11:18 AM, Thomas Gleixner <tglx@linutronix.de> wrote: > > > > So this is pretty unlikely. The retry requires: > > > > get_futex_value_locked() == EFAULT; > > > > Now we drop the hash bucket locks and do: > > > > get_user(); > > > > And if that get_user() faults again, we bail out. > > I think you need to look closer. > > We have at least also that "futex_proxy_trylock_atomic() returns > -EAGAIN" case. Which triggers at some exit condition. Another thread > in the same group, perhaps never completing the exit because it's > waiting for this one? I dunno, I didn't look any closer (but this does > make me think "Hey, we should add Oleg to the Cc too", since > PF_EXITING is involved).. So maybe there is some situation where that > EAGAIN will keep happening, forever.. > > Now, I'm *not* saying that that is it. It's quite possible/likely some > other loop, but I do have to say that it sure isn't _obvious_. And > that whole EAGAIN return case is quite deep and special, so ... The -EAGAIN is when the user value changed, simplified: oldval = *uval; sys_futex(....., oldval) do_futex(...) { if (oldval != *uval) return -EAGAIN; } sys_exit(); And user space loops on that. Thanks, tglx ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: process 'stuck' at exit. 2013-12-10 20:33 ` Thomas Gleixner @ 2013-12-10 20:43 ` Linus Torvalds 2013-12-10 21:17 ` Thomas Gleixner 0 siblings, 1 reply; 114+ messages in thread From: Linus Torvalds @ 2013-12-10 20:43 UTC (permalink / raw) To: Thomas Gleixner Cc: Dave Jones, Darren Hart, Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman, Oleg Nesterov On Tue, Dec 10, 2013 at 12:33 PM, Thomas Gleixner <tglx@linutronix.de> wrote: > > The -EAGAIN is when the user value changed, simplified: No it's not. Thomas, stop this crap already. Look at the f*cking code carefully instead of just dismissing cases. The worrisome EAGAIN case is futex_requeue futex_proxy_trylock_atomic futex_lock_pi_atomic lookup_pi_state: ret = (p->flags & PF_EXITPIDONE) ? -ESRCH : -EAGAIN; and now futex_requeue() will do "goto repeat" for that EAGAIN case. So, Christ, Thomas, you have now *twice* dismissed a real concern with totally bogus "that can never happen" by explaining some totally unrelated *simple* case rather than the much more complex case. So please. Really. Truly look at the code and thing about it, or shut the f*ck up. No more of this shit where you glance at the code, find some simple case, and say "that can't happen", and dismiss the bug-report. So far Dave's bug-reports have generally pretty much universally shown real bugs. Being dismissive about it is not helpful, quite the reverse. Maybe the loop I'm pointing at cannot happen, but *your* explanation for why it couldn't happen was pure and utter garbage, and was clearly because you hadn't even bothered to look at all the cases. Linus ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: process 'stuck' at exit. 2013-12-10 20:43 ` Linus Torvalds @ 2013-12-10 21:17 ` Thomas Gleixner 0 siblings, 0 replies; 114+ messages in thread From: Thomas Gleixner @ 2013-12-10 21:17 UTC (permalink / raw) To: Linus Torvalds Cc: Dave Jones, Darren Hart, Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman, Oleg Nesterov On Tue, 10 Dec 2013, Linus Torvalds wrote: > On Tue, Dec 10, 2013 at 12:33 PM, Thomas Gleixner <tglx@linutronix.de> wrote: > > > > The -EAGAIN is when the user value changed, simplified: > > No it's not. > > Thomas, stop this crap already. Look at the f*cking code carefully > instead of just dismissing cases. > > The worrisome EAGAIN case is > > futex_requeue > futex_proxy_trylock_atomic > futex_lock_pi_atomic > lookup_pi_state: > ret = (p->flags & PF_EXITPIDONE) ? -ESRCH : -EAGAIN; > > and now futex_requeue() will do "goto repeat" for that EAGAIN case. > > So, Christ, Thomas, you have now *twice* dismissed a real concern with > totally bogus "that can never happen" by explaining some totally > unrelated *simple* case rather than the much more complex case. > > So please. Really. Truly look at the code and thing about it, or shut > the f*ck up. No more of this shit where you glance at the code, find > some simple case, and say "that can't happen", and dismiss the > bug-report. Well, I spent a fricking long time to work on that code and find the absurdest bugs in it and I'm well aware of the exit issue. > So far Dave's bug-reports have generally pretty much universally shown > real bugs. Being dismissive about it is not helpful, quite the > reverse. I know and I used that information more than once to carefully track down the real reason. I never dismissed a single report. > Maybe the loop I'm pointing at cannot happen, but *your* explanation > for why it couldn't happen was pure and utter garbage, and was clearly > because you hadn't even bothered to look at all the cases. I might have been sloppy and not really explaining why I think, that the requeue_pi exit case is not likely. To make that loop happen it requires the following: 1) An actual user of requeue_pi, which can only be the fuzzer as glibc does not support it. But Daves last fuzzed syscall was something else. 2) The report said, that the last fuzzing test was already done and the fuzzer app is about to exit. That involves futex_requeue from deep inside glibc thread library. And that is NOT using REQUEUE_PI. 3) So now even IF it would use requeue_pi then this would require, that the outer PI lock is already held by some other task which is already in the process of exiting. IOW, this lock would be held by something which did not release it before exit and already set PF_EXITING in exit_signals() and then got stuck for ever before reaching: tsk->flags |= PF_EXITPIDONE; I still think, that it is highly unlikely, but to make sure I already asked Dave before reading your rant to fire up the tracer, so we know for sure where hell the thing is looping. Thanks, tglx ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: process 'stuck' at exit. 2013-12-10 19:55 ` Linus Torvalds 2013-12-10 20:27 ` Dave Jones 2013-12-10 20:33 ` Thomas Gleixner @ 2013-12-10 20:35 ` Oleg Nesterov 2013-12-10 20:49 ` Dave Jones 2 siblings, 1 reply; 114+ messages in thread From: Oleg Nesterov @ 2013-12-10 20:35 UTC (permalink / raw) To: Linus Torvalds, Dave Jones Cc: Thomas Gleixner, Darren Hart, Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman Dave, I must have missed something, help. I am looking at the first message and I can't understand who stuck "at exit". The trace shows that the task with pid=10818 called sys_futex() ? Perhaps "exit" means the userspace paths? Oleg. ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: process 'stuck' at exit. 2013-12-10 20:35 ` Oleg Nesterov @ 2013-12-10 20:49 ` Dave Jones 2013-12-10 21:06 ` Darren Hart 2013-12-10 21:34 ` Oleg Nesterov 0 siblings, 2 replies; 114+ messages in thread From: Dave Jones @ 2013-12-10 20:49 UTC (permalink / raw) To: Oleg Nesterov Cc: Linus Torvalds, Thomas Gleixner, Darren Hart, Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman On Tue, Dec 10, 2013 at 09:35:59PM +0100, Oleg Nesterov wrote: > Dave, I must have missed something, help. > > I am looking at the first message and I can't understand who stuck > "at exit". > > The trace shows that the task with pid=10818 called sys_futex() ? > > Perhaps "exit" means the userspace paths? pid 1131 is wait()'ing for 10818 to exit pid 1130 is periodically sending SIGKILL to 10818 because it's gotten tired of waiting. 10818 is ignoring these because it's stuck in a loop somewhere in the kernel. I tried attaching to 10818 with gdb, and it just hangs. (possibly because its weird stack situation [see 1st post]) by inspecting the shared mapping that all processes have (by gdb'ing 1130) I can see that 10818 did all its full run without incident, and the "exit child" flag in the fuzzer had been in set. The last 'random syscall' the fuzzer did was to sys_accept4, so the futex call must come from somewhere in libc maybe ? Dave ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: process 'stuck' at exit. 2013-12-10 20:49 ` Dave Jones @ 2013-12-10 21:06 ` Darren Hart 2013-12-10 21:12 ` Dave Jones 2013-12-10 21:18 ` Linus Torvalds 2013-12-10 21:34 ` Oleg Nesterov 1 sibling, 2 replies; 114+ messages in thread From: Darren Hart @ 2013-12-10 21:06 UTC (permalink / raw) To: Dave Jones Cc: Oleg Nesterov, Linus Torvalds, Thomas Gleixner, Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman On Tue, 2013-12-10 at 15:49 -0500, Dave Jones wrote: > On Tue, Dec 10, 2013 at 09:35:59PM +0100, Oleg Nesterov wrote: > > Dave, I must have missed something, help. > > > > I am looking at the first message and I can't understand who stuck > > "at exit". > > > > The trace shows that the task with pid=10818 called sys_futex() ? > > > > Perhaps "exit" means the userspace paths? > > pid 1131 is wait()'ing for 10818 to exit > > pid 1130 is periodically sending SIGKILL to 10818 because it's gotten > tired of waiting. 10818 is ignoring these because it's stuck in a loop > somewhere in the kernel. > > I tried attaching to 10818 with gdb, and it just hangs. > (possibly because its weird stack situation [see 1st post]) > > by inspecting the shared mapping that all processes have (by gdb'ing 1130) > I can see that 10818 did all its full run without incident, and the > "exit child" flag in the fuzzer had been in set. > > The last 'random syscall' the fuzzer did was to sys_accept4, so the futex call > must come from somewhere in libc maybe ? If that is the case, then Linus' requeue_pi path is highly unlikely as FUTEX_CMP_REQUEUE_PI is not used by glibc (yet). That gives me hope as that way there be dragons. Knowing exactly what syscall was made would be very useful, but I don't know if that information is even available anymore. -- Darren Hart Intel Open Source Technology Center Yocto Project - Linux Kernel ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: process 'stuck' at exit. 2013-12-10 21:06 ` Darren Hart @ 2013-12-10 21:12 ` Dave Jones 2013-12-10 21:18 ` Linus Torvalds 1 sibling, 0 replies; 114+ messages in thread From: Dave Jones @ 2013-12-10 21:12 UTC (permalink / raw) To: Darren Hart Cc: Oleg Nesterov, Linus Torvalds, Thomas Gleixner, Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman On Tue, Dec 10, 2013 at 01:06:23PM -0800, Darren Hart wrote: > On Tue, 2013-12-10 at 15:49 -0500, Dave Jones wrote: > > On Tue, Dec 10, 2013 at 09:35:59PM +0100, Oleg Nesterov wrote: > > > Dave, I must have missed something, help. > > > > > > I am looking at the first message and I can't understand who stuck > > > "at exit". > > > > > > The trace shows that the task with pid=10818 called sys_futex() ? > > > > > > Perhaps "exit" means the userspace paths? > > > > pid 1131 is wait()'ing for 10818 to exit > > > > pid 1130 is periodically sending SIGKILL to 10818 because it's gotten > > tired of waiting. 10818 is ignoring these because it's stuck in a loop > > somewhere in the kernel. > > > > I tried attaching to 10818 with gdb, and it just hangs. > > (possibly because its weird stack situation [see 1st post]) > > > > by inspecting the shared mapping that all processes have (by gdb'ing 1130) > > I can see that 10818 did all its full run without incident, and the > > "exit child" flag in the fuzzer had been in set. > > > > The last 'random syscall' the fuzzer did was to sys_accept4, so the futex call > > must come from somewhere in libc maybe ? > > If that is the case, then Linus' requeue_pi path is highly unlikely as > FUTEX_CMP_REQUEUE_PI is not used by glibc (yet). That gives me hope as > that way there be dragons. Knowing exactly what syscall was made would > be very useful, but I don't know if that information is even available > anymore. So that last syscall _that_ 'stuck' thread did was accept4, but I found another pid that had done a futex call just before it exited. (see other mail) What I don't understand is how the running child has futex as part of its stack trace, when the internal log that trinity keeps has on record for that particular pid having called it. Dave ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: process 'stuck' at exit. 2013-12-10 21:06 ` Darren Hart 2013-12-10 21:12 ` Dave Jones @ 2013-12-10 21:18 ` Linus Torvalds 2013-12-10 21:24 ` Linus Torvalds 2013-12-10 21:32 ` Dave Jones 1 sibling, 2 replies; 114+ messages in thread From: Linus Torvalds @ 2013-12-10 21:18 UTC (permalink / raw) To: Darren Hart Cc: Dave Jones, Oleg Nesterov, Thomas Gleixner, Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman On Tue, Dec 10, 2013 at 1:06 PM, Darren Hart <dvhart@linux.intel.com> wrote: > > . Knowing exactly what syscall was made would > be very useful, but I don't know if that information is even available > anymore. Well, the loop should be visible in the profile, since it's still active. So doing something like perf record -e cycles:pp -g -a sleep 60 to get a good profile for a minute, and then looking at the instruction-level profiles in futex_requeue() should be possible. Of course, inlining etc often makes those rather hard to see, and the bulk of the profile is clearly all about the lock debugging overhead, but if the loop is in futex_requeue() then that *should* be visible in the profile, even if it may not be anywhere near the top.. Dave? Linus ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: process 'stuck' at exit. 2013-12-10 21:18 ` Linus Torvalds @ 2013-12-10 21:24 ` Linus Torvalds 2013-12-10 21:32 ` Dave Jones 1 sibling, 0 replies; 114+ messages in thread From: Linus Torvalds @ 2013-12-10 21:24 UTC (permalink / raw) To: Darren Hart Cc: Dave Jones, Oleg Nesterov, Thomas Gleixner, Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman On Tue, Dec 10, 2013 at 1:18 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > to get a good profile for a minute, and then looking at the > instruction-level profiles in futex_requeue() should be possible. Ugh. Looking at kernel/futex.s even *without* debugging enabled is pretty messy. Although much of it seems to be because the hash is actually fairly complex. jhash2() really ends up expanding to a lot of instructions. But especially if that kernel was compiled with debug information, then 'perf report' is pretty good about matching it up with the source code, so it might not be *too* hard to try to figure out where the loop is. And if it's not in futex_requeue() (there are loops at other levels, like futex_get_key()), then having the callchain information for the costly operations will hopefully give a hint where the top loop is. So profiling data can be very powerful for things like this. Linus ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: process 'stuck' at exit. 2013-12-10 21:18 ` Linus Torvalds 2013-12-10 21:24 ` Linus Torvalds @ 2013-12-10 21:32 ` Dave Jones 2013-12-10 21:49 ` Linus Torvalds 1 sibling, 1 reply; 114+ messages in thread From: Dave Jones @ 2013-12-10 21:32 UTC (permalink / raw) To: Linus Torvalds Cc: Darren Hart, Oleg Nesterov, Thomas Gleixner, Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman On Tue, Dec 10, 2013 at 01:18:20PM -0800, Linus Torvalds wrote: > On Tue, Dec 10, 2013 at 1:06 PM, Darren Hart <dvhart@linux.intel.com> wrote: > > > > . Knowing exactly what syscall was made would > > be very useful, but I don't know if that information is even available > > anymore. > > Well, the loop should be visible in the profile, since it's still active. > > So doing something like > > perf record -e cycles:pp -g -a sleep 60 > > to get a good profile for a minute, and then looking at the > instruction-level profiles in futex_requeue() should be possible. > > Of course, inlining etc often makes those rather hard to see, and the > bulk of the profile is clearly all about the lock debugging overhead, > but if the loop is in futex_requeue() then that *should* be visible in > the profile, even if it may not be anywhere near the top.. > > Dave? http://www.codemonkey.org.uk/junk/perf.data.xz Dave ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: process 'stuck' at exit. 2013-12-10 21:32 ` Dave Jones @ 2013-12-10 21:49 ` Linus Torvalds 2013-12-10 21:56 ` Dave Jones 2013-12-11 12:45 ` Ingo Molnar 0 siblings, 2 replies; 114+ messages in thread From: Linus Torvalds @ 2013-12-10 21:49 UTC (permalink / raw) To: Dave Jones, Linus Torvalds, Darren Hart, Oleg Nesterov, Thomas Gleixner, Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman On Tue, Dec 10, 2013 at 1:32 PM, Dave Jones <davej@redhat.com> wrote: > > http://www.codemonkey.org.uk/junk/perf.data.xz "Forbidden You don't have permission to access /junk/perf.data.xz on this server." also, we'd need the vmlinux file to actually decode the data, I think. Linus ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: process 'stuck' at exit. 2013-12-10 21:49 ` Linus Torvalds @ 2013-12-10 21:56 ` Dave Jones 2013-12-10 21:59 ` Linus Torvalds 2013-12-11 12:45 ` Ingo Molnar 1 sibling, 1 reply; 114+ messages in thread From: Dave Jones @ 2013-12-10 21:56 UTC (permalink / raw) To: Linus Torvalds Cc: Darren Hart, Oleg Nesterov, Thomas Gleixner, Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman On Tue, Dec 10, 2013 at 01:49:19PM -0800, Linus Torvalds wrote: > On Tue, Dec 10, 2013 at 1:32 PM, Dave Jones <davej@redhat.com> wrote: > > > > http://www.codemonkey.org.uk/junk/perf.data.xz > > "Forbidden > > You don't have permission to access /junk/perf.data.xz on this server." fixed. > also, we'd need the vmlinux file to actually decode the data, I think. Hmm, the only vmlinux I have still around is newer than the running kernel, so that's not going to be much help. Dave ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: process 'stuck' at exit. 2013-12-10 21:56 ` Dave Jones @ 2013-12-10 21:59 ` Linus Torvalds 2013-12-10 22:07 ` Dave Jones 0 siblings, 1 reply; 114+ messages in thread From: Linus Torvalds @ 2013-12-10 21:59 UTC (permalink / raw) To: Dave Jones, Linus Torvalds, Darren Hart, Oleg Nesterov, Thomas Gleixner, Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman On Tue, Dec 10, 2013 at 1:56 PM, Dave Jones <davej@redhat.com> wrote: > > Hmm, the only vmlinux I have still around is newer than the running kernel, > so that's not going to be much help. Ok, /proc/kallsyms would do it, but never mind. I think you already pinpointed where the loop is with the trace file, so no need for trying to decode the profile. Linus ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: process 'stuck' at exit. 2013-12-10 21:59 ` Linus Torvalds @ 2013-12-10 22:07 ` Dave Jones 0 siblings, 0 replies; 114+ messages in thread From: Dave Jones @ 2013-12-10 22:07 UTC (permalink / raw) To: Linus Torvalds Cc: Darren Hart, Oleg Nesterov, Thomas Gleixner, Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman On Tue, Dec 10, 2013 at 01:59:30PM -0800, Linus Torvalds wrote: > On Tue, Dec 10, 2013 at 1:56 PM, Dave Jones <davej@redhat.com> wrote: > > > > Hmm, the only vmlinux I have still around is newer than the running kernel, > > so that's not going to be much help. > > Ok, /proc/kallsyms would do it, but never mind. I think you already > pinpointed where the loop is with the trace file, so no need for > trying to decode the profile. just for completeness in case someone else is following along and prefers to try that: http://www.codemonkey.org.uk/junk/syms Dave ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: process 'stuck' at exit. 2013-12-10 21:49 ` Linus Torvalds 2013-12-10 21:56 ` Dave Jones @ 2013-12-11 12:45 ` Ingo Molnar 1 sibling, 0 replies; 114+ messages in thread From: Ingo Molnar @ 2013-12-11 12:45 UTC (permalink / raw) To: Linus Torvalds Cc: Dave Jones, Darren Hart, Oleg Nesterov, Thomas Gleixner, Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman, Arnaldo Carvalho de Melo * Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Tue, Dec 10, 2013 at 1:32 PM, Dave Jones <davej@redhat.com> wrote: > > > > http://www.codemonkey.org.uk/junk/perf.data.xz > > "Forbidden > > You don't have permission to access /junk/perf.data.xz on this server." > > also, we'd need the vmlinux file to actually decode the data, I think. So my point is somewhat moot in light of a fix patch, but the best way to export all interesting data is via running 'perf archive' after the perf.data got created: comet:~/tip> perf archive Now please run: $ tar xvf perf.data.tar.bz2 -C ~/.debug wherever you need to run 'perf report' on. comet:~/tip> and copying over both the perf.data and the perf.data.tar.bz2. Arnaldo, I think we should make it even easier and more obvious to export/import perf data, via something like: perf clean # cleans out ~/.debug perf record ... perf export # creates perf.data.tar.bz2 which includes perf.data as well, not just .debug and then whoever gets a perf.data.tar.bz2 can do: perf import # does 'tar xjvf perf.data.tar.bz2' perf report # analyze the data as if it was captured on your own box which extracts it into the local directory. This makes the whole thing rather easy and makes the result complete and trustable. What do you think? Thanks, Ingo ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: process 'stuck' at exit. 2013-12-10 20:49 ` Dave Jones 2013-12-10 21:06 ` Darren Hart @ 2013-12-10 21:34 ` Oleg Nesterov 2013-12-10 21:41 ` Dave Jones 1 sibling, 1 reply; 114+ messages in thread From: Oleg Nesterov @ 2013-12-10 21:34 UTC (permalink / raw) To: Dave Jones, Linus Torvalds, Thomas Gleixner, Darren Hart, Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman On 12/10, Dave Jones wrote: > > On Tue, Dec 10, 2013 at 09:35:59PM +0100, Oleg Nesterov wrote: > > > > I am looking at the first message and I can't understand who stuck > > "at exit". > > > > The trace shows that the task with pid=10818 called sys_futex() ? > > > > Perhaps "exit" means the userspace paths? > > pid 1131 is wait()'ing for 10818 to exit > > pid 1130 is periodically sending SIGKILL to 10818 because it's gotten > tired of waiting. 10818 is ignoring these because it's stuck in a loop > somewhere in the kernel. OK, thanks. So it doesn't return to user-space. could you do cd /sys/kernel/debug/tracing/ echo 10818 >> set_ftrace_pid echo function_graph >> current_tracer echo 1 >> tracing_on and look into "trace" file to find out how exactly it loops? > I tried attaching to 10818 with gdb, and it just hangs. This is clear, 10818 obviously can't stop. Oleg. ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: process 'stuck' at exit. 2013-12-10 21:34 ` Oleg Nesterov @ 2013-12-10 21:41 ` Dave Jones 2013-12-10 21:57 ` Linus Torvalds 2013-12-10 22:09 ` Steven Rostedt 0 siblings, 2 replies; 114+ messages in thread From: Dave Jones @ 2013-12-10 21:41 UTC (permalink / raw) To: Oleg Nesterov Cc: Linus Torvalds, Thomas Gleixner, Darren Hart, Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman On Tue, Dec 10, 2013 at 10:34:31PM +0100, Oleg Nesterov wrote: > On 12/10, Dave Jones wrote: > > > > On Tue, Dec 10, 2013 at 09:35:59PM +0100, Oleg Nesterov wrote: > > > > > > I am looking at the first message and I can't understand who stuck > > > "at exit". > > > > > > The trace shows that the task with pid=10818 called sys_futex() ? > > > > > > Perhaps "exit" means the userspace paths? > > > > pid 1131 is wait()'ing for 10818 to exit > > > > pid 1130 is periodically sending SIGKILL to 10818 because it's gotten > > tired of waiting. 10818 is ignoring these because it's stuck in a loop > > somewhere in the kernel. > > OK, thanks. So it doesn't return to user-space. > > could you do > > cd /sys/kernel/debug/tracing/ > echo 10818 >> set_ftrace_pid > echo function_graph >> current_tracer > echo 1 >> tracing_on > > and look into "trace" file to find out how exactly it loops? http://codemonkey.org.uk/junk/trace Dave ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: process 'stuck' at exit. 2013-12-10 21:41 ` Dave Jones @ 2013-12-10 21:57 ` Linus Torvalds 2013-12-10 22:02 ` Dave Jones ` (3 more replies) 2013-12-10 22:09 ` Steven Rostedt 1 sibling, 4 replies; 114+ messages in thread From: Linus Torvalds @ 2013-12-10 21:57 UTC (permalink / raw) To: Dave Jones, Oleg Nesterov, Linus Torvalds, Thomas Gleixner, Darren Hart, Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman On Tue, Dec 10, 2013 at 1:41 PM, Dave Jones <davej@redhat.com> wrote: > > http://codemonkey.org.uk/junk/trace Hmm. Ok, so something is calling [__]get_user_pages_fast() and put_page() in a loop, but the trace doesn't show what that "something" is, because it is itself not ever called. However, that pattern does seem to imply that the loop is in get_futex_key(), because all the other loops I see seem to be calling other things as well. And the __get_user_pages_fast() call implies that it's the THP case that triggers the "unlikely(PageTail(page))" case. And anyway, otherwise we'd see lock_page()/unlock_page() too. So it looks like __get_user_pages_fast() fails, and keeps failing. Andrea, this is your code, any ideas? Commit a5b338f2b0b1f ("thp: update futex compound knowledge") to be exact. Linus ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: process 'stuck' at exit. 2013-12-10 21:57 ` Linus Torvalds @ 2013-12-10 22:02 ` Dave Jones 2013-12-10 22:09 ` Oleg Nesterov ` (2 subsequent siblings) 3 siblings, 0 replies; 114+ messages in thread From: Dave Jones @ 2013-12-10 22:02 UTC (permalink / raw) To: Linus Torvalds Cc: Oleg Nesterov, Thomas Gleixner, Darren Hart, Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman On Tue, Dec 10, 2013 at 01:57:49PM -0800, Linus Torvalds wrote: > On Tue, Dec 10, 2013 at 1:41 PM, Dave Jones <davej@redhat.com> wrote: > > > > http://codemonkey.org.uk/junk/trace > > Hmm. Ok, so something is calling [__]get_user_pages_fast() and > put_page() in a loop, but the trace doesn't show what that "something" > is, because it is itself not ever called. > > However, that pattern does seem to imply that the loop is in > get_futex_key(), because all the other loops I see seem to be calling > other things as well. > > And the __get_user_pages_fast() call implies that it's the THP case > that triggers the "unlikely(PageTail(page))" case. And anyway, > otherwise we'd see lock_page()/unlock_page() too. > > So it looks like __get_user_pages_fast() fails, and keeps failing. > Andrea, this is your code, any ideas? Commit a5b338f2b0b1f ("thp: > update futex compound knowledge") to be exact. So, a reason that this might only be showing up now, is that in the last week I added support to trinity to explicitly do huge page mmaps in the children, whereas before it only ever did that with MAP_SHARED in the main pid, and then every child inherited them on fork(). Dave ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: process 'stuck' at exit. 2013-12-10 21:57 ` Linus Torvalds 2013-12-10 22:02 ` Dave Jones @ 2013-12-10 22:09 ` Oleg Nesterov 2013-12-10 22:19 ` Linus Torvalds 2013-12-12 19:00 ` Andrea Arcangeli 3 siblings, 0 replies; 114+ messages in thread From: Oleg Nesterov @ 2013-12-10 22:09 UTC (permalink / raw) To: Linus Torvalds Cc: Dave Jones, Thomas Gleixner, Darren Hart, Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman On 12/10, Linus Torvalds wrote: > > So it looks like __get_user_pages_fast() fails, and keeps failing. And "again:" does get_user_pages_fast(). And according to this trace get_user_pages_fast() takes mmap_sem and calls __get_user_pages(). But __get_user_pages() should fail even before follow_page_mask() due to fatal_signal_pending(), in this case get_futex_key() should return the error. is_vm_hugetlb_page(vma). So it seems that this is not THP but VM_HUGETLB and follow_hugetlb_page() succeeds every time? Oleg. ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: process 'stuck' at exit. 2013-12-10 21:57 ` Linus Torvalds 2013-12-10 22:02 ` Dave Jones 2013-12-10 22:09 ` Oleg Nesterov @ 2013-12-10 22:19 ` Linus Torvalds 2013-12-10 22:33 ` Linus Torvalds 2013-12-10 22:42 ` process 'stuck' at exit Thomas Gleixner 2013-12-12 19:00 ` Andrea Arcangeli 3 siblings, 2 replies; 114+ messages in thread From: Linus Torvalds @ 2013-12-10 22:19 UTC (permalink / raw) To: Dave Jones, Oleg Nesterov, Linus Torvalds, Thomas Gleixner, Darren Hart, Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman [-- Attachment #1: Type: text/plain, Size: 664 bytes --] On Tue, Dec 10, 2013 at 1:57 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > So it looks like __get_user_pages_fast() fails, and keeps failing. Hmm.. Is any of the addresses unchecked, perhaps? __get_user_pages_fast() does an access_ok() check, while get_user_pages_fast() does *not* seem to do one. That looks a bit dangerous. Yeah, users should have checked the address range, but there really is no reason not to do it in get_user_pages_fast(). And it looks like the futex code is actually seriously buggered. It only does the access_ok() check for the non-shared case. Why? Shouldn't we do something like the attached? Linus [-- Attachment #2: patch.diff --] [-- Type: text/plain, Size: 1379 bytes --] arch/x86/mm/gup.c | 4 +--- kernel/futex.c | 5 +++-- 2 files changed, 4 insertions(+), 5 deletions(-) diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c index dd74e46828c0..61e726597599 100644 --- a/arch/x86/mm/gup.c +++ b/arch/x86/mm/gup.c @@ -326,10 +326,8 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write, if (end < start) goto slow_irqon; -#ifdef CONFIG_X86_64 - if (end >> __VIRTUAL_MASK_SHIFT) + if (end > TASK_SIZE_MAX) goto slow_irqon; -#endif /* * XXX: batch / limit 'nr', to avoid large irq off latency diff --git a/kernel/futex.c b/kernel/futex.c index 80ba086f021d..23c12a35dca2 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -251,6 +251,9 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key, int rw) return -EINVAL; address -= key->both.offset; + if (unlikely(!access_ok(VERIFY_WRITE, uaddr, sizeof(u32)))) + return -EFAULT; + /* * PROCESS_PRIVATE futexes are fast. * As the mm cannot disappear under us and the 'key' only needs @@ -259,8 +262,6 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key, int rw) * but access_ok() should be faster than find_vma() */ if (!fshared) { - if (unlikely(!access_ok(VERIFY_WRITE, uaddr, sizeof(u32)))) - return -EFAULT; key->private.mm = mm; key->private.address = address; get_futex_key_refs(key); ^ permalink raw reply related [flat|nested] 114+ messages in thread
* Re: process 'stuck' at exit. 2013-12-10 22:19 ` Linus Torvalds @ 2013-12-10 22:33 ` Linus Torvalds 2013-12-10 22:38 ` Darren Hart ` (2 more replies) 2013-12-10 22:42 ` process 'stuck' at exit Thomas Gleixner 1 sibling, 3 replies; 114+ messages in thread From: Linus Torvalds @ 2013-12-10 22:33 UTC (permalink / raw) To: Dave Jones, Oleg Nesterov, Linus Torvalds, Thomas Gleixner, Darren Hart, Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman [-- Attachment #1: Type: text/plain, Size: 1138 bytes --] On Tue, Dec 10, 2013 at 2:19 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Shouldn't we do something like the attached? So I think that kernel/futex.c part of the patch might be a good idea, but on x86-64 (which is what Dave is running), the if (end >> __VIRTUAL_MASK_SHIFT) test in get_user_pages_fast() should have been equivalent. And even on 32-bit, we do check the _PAGE_USER bits in the page tables, so I guess it's all good on a get_user_pages_fast() side. So never mind. It's not the address checking. And I think I see what's up. I think what happens is: - get_user_pages_fast(address, 1, 1, &page) fails (because it's read-only) - get_user_pages_fast(address, 1, 0, &page) succeeds and gets a large-page - __get_user_pages_fast(address, 1, 1, &page) fails (because it's read-only). so what triggers this is likely that Dave now does large-pages, and one of them is a read-only mapping. So I would suggest replacing the second "1" in the __get_user_pages_fast() call with a "!ro" instead. So how about this second patch instead (the access_ok() move remains). Comments? Linus [-- Attachment #2: patch.diff --] [-- Type: text/plain, Size: 1255 bytes --] kernel/futex.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/kernel/futex.c b/kernel/futex.c index 80ba086f021d..6272f560385c 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -251,6 +251,9 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key, int rw) return -EINVAL; address -= key->both.offset; + if (unlikely(!access_ok(VERIFY_WRITE, uaddr, sizeof(u32)))) + return -EFAULT; + /* * PROCESS_PRIVATE futexes are fast. * As the mm cannot disappear under us and the 'key' only needs @@ -259,8 +262,6 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key, int rw) * but access_ok() should be faster than find_vma() */ if (!fshared) { - if (unlikely(!access_ok(VERIFY_WRITE, uaddr, sizeof(u32)))) - return -EFAULT; key->private.mm = mm; key->private.address = address; get_futex_key_refs(key); @@ -288,7 +289,7 @@ again: put_page(page); /* serialize against __split_huge_page_splitting() */ local_irq_disable(); - if (likely(__get_user_pages_fast(address, 1, 1, &page) == 1)) { + if (likely(__get_user_pages_fast(address, 1, !ro, &page) == 1)) { page_head = compound_head(page); /* * page_head is valid pointer but we must pin ^ permalink raw reply related [flat|nested] 114+ messages in thread
* Re: process 'stuck' at exit. 2013-12-10 22:33 ` Linus Torvalds @ 2013-12-10 22:38 ` Darren Hart 2013-12-10 22:57 ` Thomas Gleixner 2013-12-11 17:08 ` Oleg Nesterov 2 siblings, 0 replies; 114+ messages in thread From: Darren Hart @ 2013-12-10 22:38 UTC (permalink / raw) To: Linus Torvalds Cc: Dave Jones, Oleg Nesterov, Thomas Gleixner, Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman On Tue, 2013-12-10 at 14:33 -0800, Linus Torvalds wrote: > On Tue, Dec 10, 2013 at 2:19 PM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > Shouldn't we do something like the attached? > > So I think that kernel/futex.c part of the patch might be a good idea, > but on x86-64 (which is what Dave is running), the > > if (end >> __VIRTUAL_MASK_SHIFT) > > test in get_user_pages_fast() should have been equivalent. And even on > 32-bit, we do check the _PAGE_USER bits in the page tables, so I guess > it's all good on a get_user_pages_fast() side. > > So never mind. It's not the address checking. > > And I think I see what's up. > > I think what happens is: > - get_user_pages_fast(address, 1, 1, &page) fails (because it's read-only) > - get_user_pages_fast(address, 1, 0, &page) succeeds and gets a large-page > - __get_user_pages_fast(address, 1, 1, &page) fails (because it's read-only). > > so what triggers this is likely that Dave now does large-pages, and > one of them is a read-only mapping. > > So I would suggest replacing the second "1" in the > __get_user_pages_fast() call with a "!ro" instead. So how about this > second patch instead (the access_ok() move remains). > > Comments? > You're too fast for me, but I'm trying to keep up. It was my understanding that access_ok() was an optimization for private futexes, but something more heavy weight was required for shared. I believe that was find_vma() in the past, but Peter Z changed that with fast gup. Trying to page that all in now and get the exact details.... -- Darren Hart Intel Open Source Technology Center Yocto Project - Linux Kernel ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: process 'stuck' at exit. 2013-12-10 22:33 ` Linus Torvalds 2013-12-10 22:38 ` Darren Hart @ 2013-12-10 22:57 ` Thomas Gleixner 2013-12-10 23:05 ` Linus Torvalds 2013-12-11 17:08 ` Oleg Nesterov 2 siblings, 1 reply; 114+ messages in thread From: Thomas Gleixner @ 2013-12-10 22:57 UTC (permalink / raw) To: Linus Torvalds Cc: Dave Jones, Oleg Nesterov, Darren Hart, Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman On Tue, 10 Dec 2013, Linus Torvalds wrote: > On Tue, Dec 10, 2013 at 2:19 PM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > Shouldn't we do something like the attached? > > So I think that kernel/futex.c part of the patch might be a good idea, > but on x86-64 (which is what Dave is running), the > > if (end >> __VIRTUAL_MASK_SHIFT) > > test in get_user_pages_fast() should have been equivalent. And even on > 32-bit, we do check the _PAGE_USER bits in the page tables, so I guess > it's all good on a get_user_pages_fast() side. > > So never mind. It's not the address checking. > > And I think I see what's up. > > I think what happens is: > - get_user_pages_fast(address, 1, 1, &page) fails (because it's read-only) > - get_user_pages_fast(address, 1, 0, &page) succeeds and gets a large-page > - __get_user_pages_fast(address, 1, 1, &page) fails (because it's read-only). > > so what triggers this is likely that Dave now does large-pages, and > one of them is a read-only mapping. > > So I would suggest replacing the second "1" in the > __get_user_pages_fast() call with a "!ro" instead. So how about this > second patch instead (the access_ok() move remains). Yes, that's what I decoded as well. But how does the access_ok() move do anything helpful here? We really need it for the fastpath !fshared case, but for the fshared case you actively break working code, because you force a VERIFY_WRITE check into it. The VERIFY_WRITE is necessary for !fshared, because there is no way that one thread can map the futex RO and the other RW, right? But for fshared it's legitimate to have a RO mapping if you just wait for the futex. Note, that futexes are (ab)used as user space waitqueues, so RO makes sense. And your move would break them. If [__]get_user_pages_fast() does not do the right checks, then we need to fix that and not add random access_ok() checks into the call sites. Thanks, tglx ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: process 'stuck' at exit. 2013-12-10 22:57 ` Thomas Gleixner @ 2013-12-10 23:05 ` Linus Torvalds 2013-12-10 23:28 ` Thomas Gleixner 2013-12-10 23:31 ` Al Viro 0 siblings, 2 replies; 114+ messages in thread From: Linus Torvalds @ 2013-12-10 23:05 UTC (permalink / raw) To: Thomas Gleixner Cc: Dave Jones, Oleg Nesterov, Darren Hart, Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman On Tue, Dec 10, 2013 at 2:57 PM, Thomas Gleixner <tglx@linutronix.de> wrote: > > But how does the access_ok() move do anything helpful here? Just making it all more obvious. > We really need it for the fastpath !fshared case, but for the fshared > case you actively break working code, because you force a VERIFY_WRITE > check into it. The VERIFY_WRITE is necessary for !fshared, because > there is no way that one thread can map the futex RO and the other RW, > right? Nobody actually uses that argument any more (it goes back to the old i386 "let's manually verify that we have write permissions, because the CPU doesn't do it for us in the trap handling"), and it should probably be removed. But you're right that it's at least misleading. I'd love to remove it entirely, because it's not even syntax-checked, and it's confusing. But that would be a humongous patch. So these days, "access_ok()" literally just checks that the address is in the user address space range. And that would seem to always be appropriate for futexes, so why not just do it in the generic code? Linus ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: process 'stuck' at exit. 2013-12-10 23:05 ` Linus Torvalds @ 2013-12-10 23:28 ` Thomas Gleixner 2013-12-10 23:31 ` Al Viro 1 sibling, 0 replies; 114+ messages in thread From: Thomas Gleixner @ 2013-12-10 23:28 UTC (permalink / raw) To: Linus Torvalds Cc: Dave Jones, Oleg Nesterov, Darren Hart, Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman On Tue, 10 Dec 2013, Linus Torvalds wrote: > On Tue, Dec 10, 2013 at 2:57 PM, Thomas Gleixner <tglx@linutronix.de> wrote: > > > > But how does the access_ok() move do anything helpful here? > > Just making it all more obvious. > > > We really need it for the fastpath !fshared case, but for the fshared > > case you actively break working code, because you force a VERIFY_WRITE > > check into it. The VERIFY_WRITE is necessary for !fshared, because > > there is no way that one thread can map the futex RO and the other RW, > > right? > > Nobody actually uses that argument any more (it goes back to the old > i386 "let's manually verify that we have write permissions, because > the CPU doesn't do it for us in the trap handling"), and it should > probably be removed. Fair enough. > But you're right that it's at least misleading. I'd love to remove it > entirely, because it's not even syntax-checked, and it's confusing. > But that would be a humongous patch. Well, we should ask Julia for a coccinelle patch to limit the wreckage. :) Seriously, if that VERIFY_WRITE is completely useless we really want to get rid of it. > So these days, "access_ok()" literally just checks that the address is > in the user address space range. And that would seem to always be > appropriate for futexes, so why not just do it in the generic code? Agreed, but as long as the VERIFY_WRITE argument is there it needs at least a big fat comment :) Thanks, tglx ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: process 'stuck' at exit. 2013-12-10 23:05 ` Linus Torvalds 2013-12-10 23:28 ` Thomas Gleixner @ 2013-12-10 23:31 ` Al Viro 1 sibling, 0 replies; 114+ messages in thread From: Al Viro @ 2013-12-10 23:31 UTC (permalink / raw) To: Linus Torvalds Cc: Thomas Gleixner, Dave Jones, Oleg Nesterov, Darren Hart, Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman On Tue, Dec 10, 2013 at 03:05:46PM -0800, Linus Torvalds wrote: > Nobody actually uses that argument any more (it goes back to the old > i386 "let's manually verify that we have write permissions, because > the CPU doesn't do it for us in the trap handling"), and it should > probably be removed. Ah... > But you're right that it's at least misleading. I'd love to remove it > entirely, because it's not even syntax-checked, and it's confusing. > But that would be a humongous patch. > > So these days, "access_ok()" literally just checks that the address is > in the user address space range. And that would seem to always be > appropriate for futexes, so why not just do it in the generic code? So let's turn it into #define access_ok(type, addr, size) address_ok(addr, size) and then users can be converted at leisure. Eventually we'll just remove the unused macros... ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: process 'stuck' at exit. 2013-12-10 22:33 ` Linus Torvalds 2013-12-10 22:38 ` Darren Hart 2013-12-10 22:57 ` Thomas Gleixner @ 2013-12-11 17:08 ` Oleg Nesterov 2013-12-11 17:18 ` Thomas Gleixner 2 siblings, 1 reply; 114+ messages in thread From: Oleg Nesterov @ 2013-12-11 17:08 UTC (permalink / raw) To: Linus Torvalds Cc: Dave Jones, Thomas Gleixner, Darren Hart, Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman On 12/10, Linus Torvalds wrote: > > I think what happens is: > - get_user_pages_fast(address, 1, 1, &page) fails (because it's read-only) > - get_user_pages_fast(address, 1, 0, &page) succeeds and gets a large-page > - __get_user_pages_fast(address, 1, 1, &page) fails (because it's read-only). > > so what triggers this is likely that Dave now does large-pages, and > one of them is a read-only mapping. > > So I would suggest replacing the second "1" in the > __get_user_pages_fast() call with a "!ro" instead. So how about this > second patch instead (the access_ok() move remains). I know almost nothing about THP, but why we may need write == true in this case? IOW, > - if (likely(__get_user_pages_fast(address, 1, 1, &page) == 1)) { > + if (likely(__get_user_pages_fast(address, 1, !ro, &page) == 1)) { can't if (likely(__get_user_pages_fast(address, 1, 0, &page) == 1)) { work? I have to admit, I do not understand why we can't avoid this altogether. __get_page_tail() can find the stable ->first_page, why get_futex_key() can't ? Oleg. ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: process 'stuck' at exit. 2013-12-11 17:08 ` Oleg Nesterov @ 2013-12-11 17:18 ` Thomas Gleixner 2013-12-11 17:56 ` Oleg Nesterov 0 siblings, 1 reply; 114+ messages in thread From: Thomas Gleixner @ 2013-12-11 17:18 UTC (permalink / raw) To: Oleg Nesterov Cc: Linus Torvalds, Dave Jones, Darren Hart, Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman On Wed, 11 Dec 2013, Oleg Nesterov wrote: > On 12/10, Linus Torvalds wrote: > > > > I think what happens is: > > - get_user_pages_fast(address, 1, 1, &page) fails (because it's read-only) > > - get_user_pages_fast(address, 1, 0, &page) succeeds and gets a large-page > > - __get_user_pages_fast(address, 1, 1, &page) fails (because it's read-only). > > > > so what triggers this is likely that Dave now does large-pages, and > > one of them is a read-only mapping. > > > > So I would suggest replacing the second "1" in the > > __get_user_pages_fast() call with a "!ro" instead. So how about this > > second patch instead (the access_ok() move remains). > > I know almost nothing about THP, but why we may need write == true in > this case? > > IOW, > > > - if (likely(__get_user_pages_fast(address, 1, 1, &page) == 1)) { > > + if (likely(__get_user_pages_fast(address, 1, !ro, &page) == 1)) { > > can't > if (likely(__get_user_pages_fast(address, 1, 0, &page) == 1)) { > > work? If the futex call needs to write to that address, we need a writeable mapping, right? And get_user_pages_fast() will verify that for us. > I have to admit, I do not understand why we can't avoid this altogether. > > __get_page_tail() can find the stable ->first_page, why get_futex_key() > can't ? Because it can be split under us. Thanks, tglx ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: process 'stuck' at exit. 2013-12-11 17:18 ` Thomas Gleixner @ 2013-12-11 17:56 ` Oleg Nesterov 2013-12-11 19:18 ` PATCH? introduce get_compound_page (Was: process 'stuck' at exit) Oleg Nesterov 0 siblings, 1 reply; 114+ messages in thread From: Oleg Nesterov @ 2013-12-11 17:56 UTC (permalink / raw) To: Thomas Gleixner Cc: Linus Torvalds, Dave Jones, Darren Hart, Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman On 12/11, Thomas Gleixner wrote: > > On Wed, 11 Dec 2013, Oleg Nesterov wrote: > > > > I know almost nothing about THP, but why we may need write == true in > > this case? > > > > IOW, > > > > > - if (likely(__get_user_pages_fast(address, 1, 1, &page) == 1)) { > > > + if (likely(__get_user_pages_fast(address, 1, !ro, &page) == 1)) { > > > > can't > > if (likely(__get_user_pages_fast(address, 1, 0, &page) == 1)) { > > > > work? > > If the futex call needs to write to that address, we need a writeable > mapping, right? And get_user_pages_fast() will verify that for us. Yes sure. I meant __get_user_pages_fast() which is called to find page_head. > > I have to admit, I do not understand why we can't avoid this altogether. > > > > __get_page_tail() can find the stable ->first_page, why get_futex_key() > > can't ? > > Because it can be split under us. I understand, but why we can't rely on compound_lock() like __get_page_tail() does? OK, could you please explain why something like the pseudo-code below can't work? Oleg. --- x/mm/swap.c +++ x/mm/swap.c @@ -210,7 +210,7 @@ EXPORT_SYMBOL(put_page); * This function is exported but must not be called by anything other * than get_page(). It implements the slow path of get_page(). */ -bool __get_page_tail(struct page *page) +struct page *__get_page_tail(struct page *page) { /* * This takes care of get_page() if run on a tail page @@ -235,7 +235,7 @@ bool __get_page_tail(struct page *page) */ VM_BUG_ON(!PageHead(page_head)); __get_page_tail_foll(page, false); - return true; + return page_head; } else { /* * __split_huge_page_refcount run @@ -247,7 +247,7 @@ bool __get_page_tail(struct page *page) * slab on x86). */ put_page(page_head); - return false; + return NULL; } } @@ -264,10 +264,12 @@ bool __get_page_tail(struct page *page) got = true; } compound_unlock_irqrestore(page_head, flags); - if (unlikely(!got)) + if (likely(got)) + return page_head; + else put_page(page_head); } - return got; + return NULL; } EXPORT_SYMBOL(__get_page_tail); --- x/kernel/futex.c +++ x/kernel/futex.c @@ -282,41 +282,11 @@ again: else err = 0; -#ifdef CONFIG_TRANSPARENT_HUGEPAGE - page_head = page; - if (unlikely(PageTail(page))) { + page_head = __get_page_tail(page); + if (page_head) { put_page(page); - /* serialize against __split_huge_page_splitting() */ - local_irq_disable(); - if (likely(__get_user_pages_fast(address, 1, 1, &page) == 1)) { - page_head = compound_head(page); - /* - * page_head is valid pointer but we must pin - * it before taking the PG_lock and/or - * PG_compound_lock. The moment we re-enable - * irqs __split_huge_page_splitting() can - * return and the head page can be freed from - * under us. We can't take the PG_lock and/or - * PG_compound_lock on a page that could be - * freed from under us. - */ - if (page != page_head) { - get_page(page_head); - put_page(page); - } - local_irq_enable(); - } else { - local_irq_enable(); - goto again; - } - } -#else - page_head = compound_head(page); - if (page != page_head) { - get_page(page_head); - put_page(page); - } -#endif + else + page_head = page; lock_page(page_head); ^ permalink raw reply [flat|nested] 114+ messages in thread
* PATCH? introduce get_compound_page (Was: process 'stuck' at exit) 2013-12-11 17:56 ` Oleg Nesterov @ 2013-12-11 19:18 ` Oleg Nesterov 2013-12-13 15:10 ` Andrea Arcangeli 0 siblings, 1 reply; 114+ messages in thread From: Oleg Nesterov @ 2013-12-11 19:18 UTC (permalink / raw) To: Thomas Gleixner Cc: Linus Torvalds, Dave Jones, Darren Hart, Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman On 12/11, Oleg Nesterov wrote: > > On 12/11, Thomas Gleixner wrote: > > > > On Wed, 11 Dec 2013, Oleg Nesterov wrote: > > > > > > I have to admit, I do not understand why we can't avoid this altogether. > > > > > > __get_page_tail() can find the stable ->first_page, why get_futex_key() > > > can't ? > > > > Because it can be split under us. > > I understand, but why we can't rely on compound_lock() like > __get_page_tail() does? > > OK, could you please explain why something like the pseudo-code > below can't work? Seriously. Can the path below work? With this change get_futex_key() can simply do page_head = get_compound_head(page); if (page_head) put_page(page); else page_head = page; instead of CONFIG_TRANSPARENT_HUGEPAGE/else code. (and we can also remove "bool get_page_head" from __get_page_tail_foll). I am sure I missed something. But I am curious and I'd like to understand what I have missed. Thanks, Oleg. --- a/mm/swap.c +++ b/mm/swap.c @@ -210,7 +210,7 @@ EXPORT_SYMBOL(put_page); * This function is exported but must not be called by anything other * than get_page(). It implements the slow path of get_page(). */ -bool __get_page_tail(struct page *page) +static struct page *__get_compound_head(struct page *page, bool and_tail) { /* * This takes care of get_page() if run on a tail page @@ -234,8 +234,9 @@ bool __get_page_tail(struct page *page) * cannot race here. */ VM_BUG_ON(!PageHead(page_head)); - __get_page_tail_foll(page, false); - return true; + if (and_tail) + get_huge_page_tail(page); + return page_head; } else { /* * __split_huge_page_refcount run @@ -260,17 +261,34 @@ bool __get_page_tail(struct page *page) flags = compound_lock_irqsave(page_head); /* here __split_huge_page_refcount won't run anymore */ if (likely(PageTail(page))) { - __get_page_tail_foll(page, false); + if (and_tail) + get_huge_page_tail(page); got = true; } compound_unlock_irqrestore(page_head, flags); - if (unlikely(!got)) + + if (likely(got)) + return page_head; + else put_page(page_head); } - return got; + return NULL; +} + +bool __get_page_tail(struct page *page) +{ + return !!__get_compound_head(page, true); } EXPORT_SYMBOL(__get_page_tail); +struct page *get_compound_head(struct page *page) +{ + if (PageTail(page)) + return __get_compound_head(page, false); + else + return NULL; +} + /** * put_pages_list() - release a list of pages * @pages: list of pages threaded on page->lru ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: PATCH? introduce get_compound_page (Was: process 'stuck' at exit) 2013-12-11 19:18 ` PATCH? introduce get_compound_page (Was: process 'stuck' at exit) Oleg Nesterov @ 2013-12-13 15:10 ` Andrea Arcangeli 2013-12-13 16:22 ` Oleg Nesterov 0 siblings, 1 reply; 114+ messages in thread From: Andrea Arcangeli @ 2013-12-13 15:10 UTC (permalink / raw) To: Oleg Nesterov Cc: Thomas Gleixner, Linus Torvalds, Dave Jones, Darren Hart, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman On Wed, Dec 11, 2013 at 08:18:55PM +0100, Oleg Nesterov wrote: > --- a/mm/swap.c > +++ b/mm/swap.c > @@ -210,7 +210,7 @@ EXPORT_SYMBOL(put_page); > * This function is exported but must not be called by anything other > * than get_page(). It implements the slow path of get_page(). > */ > -bool __get_page_tail(struct page *page) > +static struct page *__get_compound_head(struct page *page, bool and_tail) > { This is not necessarily inline (unlike __get_page_tail_foll which is inline and optimizes away the build time known "false"/"true" params). > /* > * This takes care of get_page() if run on a tail page > @@ -234,8 +234,9 @@ bool __get_page_tail(struct page *page) > * cannot race here. > */ > VM_BUG_ON(!PageHead(page_head)); > - __get_page_tail_foll(page, false); > - return true; > + if (and_tail) > + get_huge_page_tail(page); So you may be introducing a real branch here unless gcc decides to self-inline the static func. > + return page_head; > } else { > /* > * __split_huge_page_refcount run > @@ -260,17 +261,34 @@ bool __get_page_tail(struct page *page) > flags = compound_lock_irqsave(page_head); > /* here __split_huge_page_refcount won't run anymore */ > if (likely(PageTail(page))) { > - __get_page_tail_foll(page, false); > + if (and_tail) > + get_huge_page_tail(page); Same here. > +bool __get_page_tail(struct page *page) > +{ > + return !!__get_compound_head(page, true); > } I hope gcc would optimize away the !! overhead vs the old got = true/false, once __get_compound_head is inline. So I think you should add inline to __get_compound_head, there's no benefit to keep the get_compound_head in the CPU icache. get_huge_page_tail checks different invariants in the VM_BUG_ON and is only used by gup.c not sure why to call that here. But __get_page_tail has changed in -mm so the above will reject, in -mm you will find __get_page_tail_foll called with "true" parameter too in __get_page_tail for example (not equivalent to get_huge_page_tail even ignoring the invariant checking in VM_BUG_ON) and it defines a fast path for hugetlbfs and slab (faster than the one upstream). But here we have a "tail page" that has reference on both head and tail, and you're taking one more reference on the head, just to release it later by releasing the tail. Your objective is to do: page_head = get_compound_head(page); if (page_head) put_page(page); else page_head = page; If we've to mess with the compound lock for this, considering get_compound_head would have only this user anyway, we could do: page_head = put_compound_tail(page); if (!page_head) page_head = page; Implemented as: put_compound_tail(page) { page_head = compound_trans_head(page); if (!__compound_tail_refcounted(page_head)) { smp_rmb(); if (PageTail(page)) { VM_BUG_ON(!PageHead(page_head)); VM_BUG_ON(atomic_read(&page_head->_count) <= 0); VM_BUG_ON(page_mapcount(page) != 0); return page_head; } else return NULL; } flags = compound_lock_irqsave(page_head); if (!PageTail(page)) { unlock return NULL } VM_BUG_ON(page_head != page->first_page); /* __split_huge_page_refcount will wait now */ VM_BUG_ON(page_mapcount(page) <= 0); atomic_dec(&page->_mapcount); VM_BUG_ON(atomic_read(&page_head->_count) <= 0); VM_BUG_ON(atomic_read(&page->_count) != 0); compound_unlock_irqrestore(page_head, flags); return page_head; } Considering this is going incremental with -mm, and that the -mm previous patches the above would depend on are not upstream yet, I'd suggest to fix the bug in futex with Linus patch now (s/1/!ro/). The strict obviously safe fix Linus posted, is not going to interfere with these optimization later, so there's no downside to apply it now and defer the optimizations to the -mm queue. Thanks! Andrea ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: PATCH? introduce get_compound_page (Was: process 'stuck' at exit) 2013-12-13 15:10 ` Andrea Arcangeli @ 2013-12-13 16:22 ` Oleg Nesterov 2013-12-13 17:34 ` Andrea Arcangeli 0 siblings, 1 reply; 114+ messages in thread From: Oleg Nesterov @ 2013-12-13 16:22 UTC (permalink / raw) To: Andrea Arcangeli Cc: Thomas Gleixner, Linus Torvalds, Dave Jones, Darren Hart, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman Andrea. Thanks a lot for the detailed reply. On 12/13, Andrea Arcangeli wrote: > > On Wed, Dec 11, 2013 at 08:18:55PM +0100, Oleg Nesterov wrote: > > get_huge_page_tail checks different invariants in the VM_BUG_ON and is > only used by gup.c not sure why to call that here. Compared to __get_page_tail_foll(get_page_head => F) get_huge_page_tail() lacks the ->first_page->_count, but it is called right after get_page_unless_zero(page_head) so to me get_huge_page_tail() looks a bit better. Personally, I think that even this change --- x/kernel/events/../../mm/internal.h +++ x/kernel/events/../../mm/internal.h @@ -47,11 +47,9 @@ static inline void __get_page_tail_foll( * page_cache_get_speculative()) on tail pages. */ VM_BUG_ON(atomic_read(&page->first_page->_count) <= 0); - VM_BUG_ON(atomic_read(&page->_count) != 0); - VM_BUG_ON(page_mapcount(page) < 0); if (get_page_head) atomic_inc(&page->first_page->_count); - atomic_inc(&page->_mapcount); + get_huge_page_tail(page); } /* makes sense. But this is minor and subjective, I am not going to argue. And I agree with other comments, the patch I sent only tried to explain what I meant. > If we've to mess with the compound lock for this, considering > get_compound_head would have only this user anyway, we could do: > > page_head = put_compound_tail(page); > if (!page_head) > page_head = page; > > Implemented as: OK, this looks better, I agree. I'll try to make v2 based on -mm and your suggestions. > Considering this is going incremental with -mm, and that the -mm > previous patches the above would depend on are not upstream yet, I'd > suggest to fix the bug in futex with Linus patch now (s/1/!ro/). Yes, yes, sure. I didn't mean that this (potential) cleanup can replace the simple fix from Linus. Thanks! Oleg. ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: PATCH? introduce get_compound_page (Was: process 'stuck' at exit) 2013-12-13 16:22 ` Oleg Nesterov @ 2013-12-13 17:34 ` Andrea Arcangeli 2013-12-16 18:36 ` Oleg Nesterov 2013-12-16 20:19 ` [PATCH 0/2] mm: thp: get_huge_page_tail() cleanups Oleg Nesterov 0 siblings, 2 replies; 114+ messages in thread From: Andrea Arcangeli @ 2013-12-13 17:34 UTC (permalink / raw) To: Oleg Nesterov Cc: Thomas Gleixner, Linus Torvalds, Dave Jones, Darren Hart, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman On Fri, Dec 13, 2013 at 05:22:40PM +0100, Oleg Nesterov wrote: > Andrea. Thanks a lot for the detailed reply. > > On 12/13, Andrea Arcangeli wrote: > > > > On Wed, Dec 11, 2013 at 08:18:55PM +0100, Oleg Nesterov wrote: > > > > get_huge_page_tail checks different invariants in the VM_BUG_ON and is > > only used by gup.c not sure why to call that here. > > Compared to __get_page_tail_foll(get_page_head => F) get_huge_page_tail() > lacks the ->first_page->_count, but it is called right after > get_page_unless_zero(page_head) so to me get_huge_page_tail() looks a bit > better. > > Personally, I think that even this change > > --- x/kernel/events/../../mm/internal.h > +++ x/kernel/events/../../mm/internal.h > @@ -47,11 +47,9 @@ static inline void __get_page_tail_foll( > * page_cache_get_speculative()) on tail pages. > */ > VM_BUG_ON(atomic_read(&page->first_page->_count) <= 0); > - VM_BUG_ON(atomic_read(&page->_count) != 0); > - VM_BUG_ON(page_mapcount(page) < 0); > if (get_page_head) > atomic_inc(&page->first_page->_count); > - atomic_inc(&page->_mapcount); > + get_huge_page_tail(page); > } > > /* > > makes sense. But this is minor and subjective, I am not going to argue. The above diff looks a straightforward cleanup you could submit it as a separate patch in a v2 series. This more clearly shows the difference between the two functions, so there is less overlap too. Feel free to change it further if you want, but the current model was to have get_huge_page_tail only for arch/*/mm/gup.c (in mm.h) and __get_page_tail_foll in internal.h for mm/*.c and with the ability to obtain the head page too (needed in some of the mm/*.c cases, and never needed for arch/*/mm/gup.c). > I'll try to make v2 based on -mm and your suggestions. Ok great! Thanks, Andrea ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: PATCH? introduce get_compound_page (Was: process 'stuck' at exit) 2013-12-13 17:34 ` Andrea Arcangeli @ 2013-12-16 18:36 ` Oleg Nesterov 2013-12-16 20:19 ` Andrea Arcangeli 2013-12-16 20:19 ` [PATCH 0/2] mm: thp: get_huge_page_tail() cleanups Oleg Nesterov 1 sibling, 1 reply; 114+ messages in thread From: Oleg Nesterov @ 2013-12-16 18:36 UTC (permalink / raw) To: Andrea Arcangeli Cc: Thomas Gleixner, Linus Torvalds, Dave Jones, Darren Hart, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman On 12/13, Andrea Arcangeli wrote: > > On Fri, Dec 13, 2013 at 05:22:40PM +0100, Oleg Nesterov wrote: > > > > I'll try to make v2 based on -mm and your suggestions. > > Ok great! Yes, it would be great, but I need your help again ;) Let me quote the pseudo-code you sent me: put_compound_tail(page) { page_head = compound_trans_head(page); if (!__compound_tail_refcounted(page_head)) { ... return ...; } flags = compound_lock_irqsave(page_head); ... Sure, put_compound_tail() should be the simplified version of put_compound_page() which doesn't dec page_head->_count, this is clear. But afaics, compound_lock_irqsave() above looks unsafe without get_page_unless_zero(page_head) ? If we race with _split, page_head can be freed and compound_lock() can race with, say, free_pages_check() which plays with page->flags ? So it seems that put_compound_tail() should also do get/put(head) like put_compound_page() does, and this probably means we should factor out the common code somehow. Or I missed something? OTOH, I can't really understand if (likely(page != page_head && get_page_unless_zero(page_head))) in __get_page_tail() and put_compound_page(). First of all, is it really possible that page == compound_trans_head(page)? We already verified that PG_tail was set. Of course this bit can be cleared and ->first_page can be a dangling pointer but it can never be changed to point to this page? (in fact, afaics it can be changed at all as long as we have a reference, but this doesn't matter). And compound_lock_irqsave() looks racy even after get_page_unless_zero(). For example, suppose that page_head was already freed and then re-allocated as (say) alloc_pages(__GFP_COMP, 1). get_page_unless_zero() can succeed right after prep_new_page() does set_page_refcounted(). Now, can't compound_lock() race with the non-atomic prep_compound_page()->__SetPageHead() ? Finally. basepage_index(page) after put_page(page) in get_futex_key() looks confusing imho. I think this is correct, we already checked PageAnon() so it can't be a thp page. But probably this needs a comment and __basepage_index() should do BUG_ON(!PageHuge()). Oleg. ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: PATCH? introduce get_compound_page (Was: process 'stuck' at exit) 2013-12-16 18:36 ` Oleg Nesterov @ 2013-12-16 20:19 ` Andrea Arcangeli 2013-12-16 20:46 ` Oleg Nesterov 2013-12-19 19:08 ` [PATCH 0/1] mm: fix the theoretical compound_lock() vs prep_new_page() race Oleg Nesterov 0 siblings, 2 replies; 114+ messages in thread From: Andrea Arcangeli @ 2013-12-16 20:19 UTC (permalink / raw) To: Oleg Nesterov Cc: Thomas Gleixner, Linus Torvalds, Dave Jones, Darren Hart, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman On Mon, Dec 16, 2013 at 07:36:18PM +0100, Oleg Nesterov wrote: > On 12/13, Andrea Arcangeli wrote: > > > > On Fri, Dec 13, 2013 at 05:22:40PM +0100, Oleg Nesterov wrote: > > > > > > I'll try to make v2 based on -mm and your suggestions. > > > > Ok great! > > Yes, it would be great, but I need your help again ;) > > > Let me quote the pseudo-code you sent me: > > put_compound_tail(page) { > page_head = compound_trans_head(page); > if (!__compound_tail_refcounted(page_head)) { > ... > return ...; > } > > flags = compound_lock_irqsave(page_head); > ... > > Sure, put_compound_tail() should be the simplified version of > put_compound_page() which doesn't dec page_head->_count, this is clear. > > But afaics, compound_lock_irqsave() above looks unsafe without > get_page_unless_zero(page_head) ? If we race with _split, page_head > can be freed and compound_lock() can race with, say, free_pages_check() > which plays with page->flags ? > > So it seems that put_compound_tail() should also do get/put(head) like > put_compound_page() does, and this probably means we should factor out > the common code somehow. > Yes it was supposed to do the get_page_unless_zero before the compound_lock. > Or I missed something? > > > > OTOH, I can't really understand > > if (likely(page != page_head && get_page_unless_zero(page_head))) > > in __get_page_tail() and put_compound_page(). > > First of all, is it really possible that page == compound_trans_head(page)? It is possible if it become a regular page because split_huge_page run in between. compound_trans_head guarantees us it got us to a safe head page. And the idea was to do get_page_unless_zero only on head pages and never on tails to be safer/simpler. Same goes for the compound_lock that follows still on the page_head where "page != page_head". If you like, you can try to optimize away the check and reuse the PageTail branch inside the compound_lock but I'm not sure if it's worth it. > We already verified that PG_tail was set. Of course this bit can be cleared > and ->first_page can be a dangling pointer but it can never be changed to > point to this page? (in fact, afaics it can be changed at all as long as we > have a reference, but this doesn't matter). If PG_tail gets cleared compound_trans_head returns "page". > And compound_lock_irqsave() looks racy even after get_page_unless_zero(). > > For example, suppose that page_head was already freed and then re-allocated > as (say) alloc_pages(__GFP_COMP, 1). get_page_unless_zero() can succeed right > after prep_new_page() does set_page_refcounted(). Now, can't compound_lock() > race with the non-atomic prep_compound_page()->__SetPageHead() ? Yes. We need to change to: if (order && (gfp_flags & __GFP_COMP)) prep_compound_page(page, order); smp_wmb(); /* as the compound_lock can be taken after it's refcounted */ set_page_refcounted(page); __SetPageHead uses bts asm insn so literally only a "lock" prefix is missing in a assembly instruction. So the race window is incredibly small, but it must be fixed indeed. This also puts set_page_refcounted as the last action of buffered_rmqueue so there shouldn't be any other issues like this left in the page allocation code. Can you reorder set_page_refcount in your v2? I wonder if arch_alloc_page needs refcount 1, it sets the page as stable on s390. the other way around is to move prep_compound_page before set_page_refcounted (though I think if we can, keeping the refcounted at the very last with a comment is preferable). > Finally. basepage_index(page) after put_page(page) in get_futex_key() looks > confusing imho. I think this is correct, we already checked PageAnon() so it > can't be a thp page. But probably this needs a comment and __basepage_index() > should do BUG_ON(!PageHuge()). This looks a bug if you apply the patches to add THP in pagecache, and BUG_ON(!PageHuge) would also break THP in pagecache later (though safer than silently broken like now). It'd safer to read the basepage_index in a local variable just before doing the put_compund_tail but I agree it's not going to make a difference right now. But the whole idea of working with the head of the THP despite the address points to a tail, looks flakey if there is a split (THP in pagecache). I mean chances are reading basepage_index(page) before releasing the tail pin is not enough. The Anon path looks sane for all cases, as it doesn't pretend to return any information on the actual mapping and it limits itself to do the PageAnon check on the actual head that has PageAnon set, which is still valid thing to do even after a split, and in fact required if a split didn't take place yet as the tail "page" isn't anon but just a tail. ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: PATCH? introduce get_compound_page (Was: process 'stuck' at exit) 2013-12-16 20:19 ` Andrea Arcangeli @ 2013-12-16 20:46 ` Oleg Nesterov 2013-12-17 16:53 ` Oleg Nesterov 2013-12-19 19:08 ` [PATCH 0/1] mm: fix the theoretical compound_lock() vs prep_new_page() race Oleg Nesterov 1 sibling, 1 reply; 114+ messages in thread From: Oleg Nesterov @ 2013-12-16 20:46 UTC (permalink / raw) To: Andrea Arcangeli Cc: Thomas Gleixner, Linus Torvalds, Dave Jones, Darren Hart, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman On 12/16, Andrea Arcangeli wrote: > > On Mon, Dec 16, 2013 at 07:36:18PM +0100, Oleg Nesterov wrote: > > > > So it seems that put_compound_tail() should also do get/put(head) like > > put_compound_page() does, and this probably means we should factor out > > the common code somehow. > > > > Yes it was supposed to do the get_page_unless_zero before the > compound_lock. OK, > > OTOH, I can't really understand > > > > if (likely(page != page_head && get_page_unless_zero(page_head))) > > > > in __get_page_tail() and put_compound_page(). > > > > First of all, is it really possible that page == compound_trans_head(page)? > > ... > > If PG_tail gets cleared compound_trans_head returns "page". Damn, indeed, thanks. > > And compound_lock_irqsave() looks racy even after get_page_unless_zero(). > > > > For example, suppose that page_head was already freed and then re-allocated > > as (say) alloc_pages(__GFP_COMP, 1). get_page_unless_zero() can succeed right > > after prep_new_page() does set_page_refcounted(). Now, can't compound_lock() > > race with the non-atomic prep_compound_page()->__SetPageHead() ? > > Yes. We need to change to: > > if (order && (gfp_flags & __GFP_COMP)) > prep_compound_page(page, order); > smp_wmb(); > /* as the compound_lock can be taken after it's refcounted */ > set_page_refcounted(page); > > __SetPageHead uses bts asm insn so literally only a "lock" prefix is > missing in a assembly instruction. So the race window is incredibly > small, but it must be fixed indeed. This also puts set_page_refcounted > as the last action of buffered_rmqueue so there shouldn't be any other > issues like this left in the page allocation code. > > Can you reorder set_page_refcount in your v2? OK. I'll try to make something on Wednesday. > > Finally. basepage_index(page) after put_page(page) in get_futex_key() looks > > confusing imho. I think this is correct, we already checked PageAnon() so it > > can't be a thp page. But probably this needs a comment and __basepage_index() > > should do BUG_ON(!PageHuge()). > > This looks a bug if you apply the patches to add THP in pagecache, and > BUG_ON(!PageHuge) would also break THP in pagecache later (though > safer than silently broken like now). > > It'd safer to read the basepage_index in a local variable just before > doing the put_compund_tail but I agree it's not going to make a > difference right now. Yes, so lets discuss (and perhaps change) this later/separately. Andrea, thanks again for your help. Oleg. ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: PATCH? introduce get_compound_page (Was: process 'stuck' at exit) 2013-12-16 20:46 ` Oleg Nesterov @ 2013-12-17 16:53 ` Oleg Nesterov 2013-12-17 18:06 ` Andrea Arcangeli 0 siblings, 1 reply; 114+ messages in thread From: Oleg Nesterov @ 2013-12-17 16:53 UTC (permalink / raw) To: Andrea Arcangeli Cc: Thomas Gleixner, Linus Torvalds, Dave Jones, Darren Hart, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman On 12/16, Oleg Nesterov wrote: > > On 12/16, Andrea Arcangeli wrote: > > > > On Mon, Dec 16, 2013 at 07:36:18PM +0100, Oleg Nesterov wrote: > > > > > > And compound_lock_irqsave() looks racy even after get_page_unless_zero(). > > > > > > For example, suppose that page_head was already freed and then re-allocated > > > as (say) alloc_pages(__GFP_COMP, 1). get_page_unless_zero() can succeed right > > > after prep_new_page() does set_page_refcounted(). Now, can't compound_lock() > > > race with the non-atomic prep_compound_page()->__SetPageHead() ? > > > > Yes. We need to change to: > > > > if (order && (gfp_flags & __GFP_COMP)) > > prep_compound_page(page, order); > > smp_wmb(); > > /* as the compound_lock can be taken after it's refcounted */ > > set_page_refcounted(page); > > > > __SetPageHead uses bts asm insn so literally only a "lock" prefix is > > missing in a assembly instruction. So the race window is incredibly > > small, but it must be fixed indeed. This also puts set_page_refcounted > > as the last action of buffered_rmqueue so there shouldn't be any other > > issues like this left in the page allocation code. > > > > Can you reorder set_page_refcount in your v2? > > OK. I'll try to make something on Wednesday. Yes, I will, but... I can't stop thinking about another change. What if we simply change __split_huge_page_refcount() to also do compound_lock/unlock(page_tail) in a main loop? This way we can greatly simplify get/put_page paths, we can rely on compound_lock(sub-page) and avoid get_page_unless_zero(page_head). Yes, this will make _split a bit slower, but PG_compound_lock should not be contended? And we should change page_tail->flags carefully, but this looks simple. Or this is not possible/desirable? Oleg. ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: PATCH? introduce get_compound_page (Was: process 'stuck' at exit) 2013-12-17 16:53 ` Oleg Nesterov @ 2013-12-17 18:06 ` Andrea Arcangeli 2013-12-18 19:19 ` [PATCH -mm 0/7] (Was: introduce get_compound_page) Oleg Nesterov 0 siblings, 1 reply; 114+ messages in thread From: Andrea Arcangeli @ 2013-12-17 18:06 UTC (permalink / raw) To: Oleg Nesterov Cc: Thomas Gleixner, Linus Torvalds, Dave Jones, Darren Hart, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman On Tue, Dec 17, 2013 at 05:53:35PM +0100, Oleg Nesterov wrote: > On 12/16, Oleg Nesterov wrote: > > > > On 12/16, Andrea Arcangeli wrote: > > > > > > On Mon, Dec 16, 2013 at 07:36:18PM +0100, Oleg Nesterov wrote: > > > > > > > > And compound_lock_irqsave() looks racy even after get_page_unless_zero(). > > > > > > > > For example, suppose that page_head was already freed and then re-allocated > > > > as (say) alloc_pages(__GFP_COMP, 1). get_page_unless_zero() can succeed right > > > > after prep_new_page() does set_page_refcounted(). Now, can't compound_lock() > > > > race with the non-atomic prep_compound_page()->__SetPageHead() ? > > > > > > Yes. We need to change to: > > > > > > if (order && (gfp_flags & __GFP_COMP)) > > > prep_compound_page(page, order); > > > smp_wmb(); > > > /* as the compound_lock can be taken after it's refcounted */ > > > set_page_refcounted(page); > > > > > > __SetPageHead uses bts asm insn so literally only a "lock" prefix is > > > missing in a assembly instruction. So the race window is incredibly > > > small, but it must be fixed indeed. This also puts set_page_refcounted > > > as the last action of buffered_rmqueue so there shouldn't be any other > > > issues like this left in the page allocation code. > > > > > > Can you reorder set_page_refcount in your v2? > > > > OK. I'll try to make something on Wednesday. > > Yes, I will, but... > > I can't stop thinking about another change. What if we simply change > __split_huge_page_refcount() to also do compound_lock/unlock(page_tail) > in a main loop? > > This way we can greatly simplify get/put_page paths, we can rely on > compound_lock(sub-page) and avoid get_page_unless_zero(page_head). > Yes, this will make _split a bit slower, but PG_compound_lock should > not be contended? And we should change page_tail->flags carefully, but > this looks simple. > > Or this is not possible/desirable? That would be 512 nested spinlocks instead of 1, last time I did something like that in mm_take_all_locks people weren't too pleased as it started firing lockdeps complains too. Generally I try to avoid taking too many locks nested if I can. mm_take_all_locks is fine because it only runs when you register an mm into a device driver, so it is a very rare event and not performance critical at all, it is a slow path by all means (only runs when you start a virtual machine or start X with nvidia etc..). So it is not a concern. split_huge_page to the contrary could run in a flood if you're unlucky. split_huge_page is needed not just to handle non-THP aware paths that mostly disappeared by now, but also when you truncate a vma so that a THP doesn't fit in it anymore. So it's up to userland how frequently it needs to run. I think it's reasonable to consider it though, but then it's not guaranteed that a put_page on a THP tail is more frequent than split_huge_page. Keep in mind we do the get_page_unless_zero to stabilize the head to take the compound_lock on it, only for the tails, never for the heads. So this restricts it to _only_ the put_page following a gup_fast. Only gup_fast can ever take a reference on the tail pages of a THP. Nothing else can. I intend to add the foll_flags to gup_fast parameter so we remove FOLL_GET from it in the KVM page fault to avoid doing atomic_inc immediately followed by atomic_dec when establishing sptes. So the only transient mappings that cannot be converted to pin-less mmu_notifier will have to run put_page in gup_fast. In short you're only going to help O_DIRECT with that change, and removing 1 locked op for every 4k written to disk may not offset the cost of 511 locked ops in split_huge_page. Still worth thinking about it, but not obvious win in my view (plus the lockdep trouble with taking too many locks nested). Comments welcome. Thanks! Andrea ^ permalink raw reply [flat|nested] 114+ messages in thread
* [PATCH -mm 0/7] (Was: introduce get_compound_page) 2013-12-17 18:06 ` Andrea Arcangeli @ 2013-12-18 19:19 ` Oleg Nesterov 2013-12-18 19:19 ` [PATCH -mm 1/7] mm: thp: introduce __put_nontail_page() Oleg Nesterov ` (6 more replies) 0 siblings, 7 replies; 114+ messages in thread From: Oleg Nesterov @ 2013-12-18 19:19 UTC (permalink / raw) To: Andrea Arcangeli, Andrew Morton Cc: Thomas Gleixner, Linus Torvalds, Dave Jones, Darren Hart, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman On 12/17, Andrea Arcangeli wrote: > > On Tue, Dec 17, 2013 at 05:53:35PM +0100, Oleg Nesterov wrote: > > > > I can't stop thinking about another change. What if we simply change > > __split_huge_page_refcount() to also do compound_lock/unlock(page_tail) > > in a main loop? > > > > This way we can greatly simplify get/put_page paths, we can rely on > > compound_lock(sub-page) and avoid get_page_unless_zero(page_head). > > Yes, this will make _split a bit slower, but PG_compound_lock should > > not be contended? And we should change page_tail->flags carefully, but > > this looks simple. > > > > Or this is not possible/desirable? > > That would be 512 nested spinlocks instead of 1, Yes. But, just in case, we do not need to take them all at once. We only need to take/release the additional lock per page_tail in a loop. > last time I did > something like that in mm_take_all_locks people weren't too pleased as > it started firing lockdeps complains too. bit_spin_lock() doesn't have the lockdep annotations and it would be trivial to add _nested if necessary. > split_huge_page to the contrary could run in a flood if > you're unlucky. split_huge_page is needed not just to handle non-THP > aware paths that mostly disappeared by now, but also when you truncate > a vma so that a THP doesn't fit in it anymore. So it's up to userland > how frequently it needs to run. Yes, I understand. I _think_ this should be fine performance-wise, compared to other things split_huge_page() paths should do these 511 test_and_set_bit + clear_bit should not be noticeable. But of course I can be wrong. > I think it's reasonable to consider it though, but then it's not > guaranteed that a put_page on a THP tail is more frequent than > split_huge_page. Keep in mind we do the get_page_unless_zero to > stabilize the head to take the compound_lock on it, only for the > tails, never for the heads. So this restricts it to _only_ the > put_page following a gup_fast. Only gup_fast can ever take a reference > on the tail pages of a THP. Nothing else can. I see. But my main point was simplification. And I'm afraid this compound_lock() in get/put can race with someone else playing with page->flags "because I own this freshly allocated page". Perhaps. However. I do understand your concerns, lets forget about this for now. --------------------------------------------------------------------------- Please review get_futex_key() changes we discussed before. As for the race with prep_new_page() I'll send another patch, it is completely orthogonal to this series. Testing. I simply have no idea how can I test this series so I didn't even try ;) I'll try to do something tomorrow, but in any case I have to rely on your review. Most of the cleanups in this series are not needed to change get_futex_key(), but imho make sense. And we need more cleanups after this series, I'll do this later if this series makes any sense. Oleg. include/linux/mm.h | 2 + kernel/futex.c | 37 +-------- mm/swap.c | 251 ++++++++++++++++++++++++++-------------------------- 3 files changed, 128 insertions(+), 162 deletions(-) ^ permalink raw reply [flat|nested] 114+ messages in thread
* [PATCH -mm 1/7] mm: thp: introduce __put_nontail_page() 2013-12-18 19:19 ` [PATCH -mm 0/7] (Was: introduce get_compound_page) Oleg Nesterov @ 2013-12-18 19:19 ` Oleg Nesterov 2013-12-18 19:19 ` [PATCH -mm 2/7] mm: thp: change __get_page_tail() to use __put_nontail_page() Oleg Nesterov ` (5 subsequent siblings) 6 siblings, 0 replies; 114+ messages in thread From: Oleg Nesterov @ 2013-12-18 19:19 UTC (permalink / raw) To: Andrea Arcangeli, Andrew Morton Cc: Thomas Gleixner, Linus Torvalds, Dave Jones, Darren Hart, Peter Zijlstra, Mel Gorman, linux-kernel Add the new helper, __put_nontail_page(page), which simply does compound/single put depending on PageHead(). put_compound_page() can use it 3 times, probably it can have other users. Signed-off-by: Oleg Nesterov <oleg@redhat.com> --- mm/swap.c | 60 ++++++++++++++++++++++++++---------------------------------- 1 files changed, 26 insertions(+), 34 deletions(-) diff --git a/mm/swap.c b/mm/swap.c index 0040d9c..f83de1f 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -79,21 +79,26 @@ static void __put_compound_page(struct page *page) (*dtor)(page); } +static void __put_nontail_page(struct page *page) +{ + if (put_page_testzero(page)) { + /* + * By the time all refcounts have been released + * split_huge_page cannot run anymore from under us. + */ + if (PageHead(page)) + __put_compound_page(page); + else + __put_single_page(page); + } +} + static void put_compound_page(struct page *page) { struct page *page_head; if (likely(!PageTail(page))) { - if (put_page_testzero(page)) { - /* - * By the time all refcounts have been released - * split_huge_page cannot run anymore from under us. - */ - if (PageHead(page)) - __put_compound_page(page); - else - __put_single_page(page); - } + __put_nontail_page(page); return; } @@ -176,24 +181,16 @@ static void put_compound_page(struct page *page) if (unlikely(!PageTail(page))) { /* __split_huge_page_refcount run before us */ compound_unlock_irqrestore(page_head, flags); - if (put_page_testzero(page_head)) { - /* - * The head page may have been freed - * and reallocated as a compound page - * of smaller order and then freed - * again. All we know is that it - * cannot have become: a THP page, a - * compound page of higher order, a - * tail page. That is because we - * still hold the refcount of the - * split THP tail and page_head was - * the THP head before the split. - */ - if (PageHead(page_head)) - __put_compound_page(page_head); - else - __put_single_page(page_head); - } + /* + * The head page may have been freed and reallocated + * as a compound page of smaller order and then freed + * again. All we know is that it cannot have become: + * a THP page, a compound page of higher order, a tail + * page. That is because we still hold the refcount of + * the split THP tail and page_head was the THP head + * before the split. + */ + __put_nontail_page(page_head); out_put_single: if (put_page_testzero(page)) __put_single_page(page); @@ -215,12 +212,7 @@ out_put_single: VM_BUG_ON(atomic_read(&page->_count) != 0); compound_unlock_irqrestore(page_head, flags); - if (put_page_testzero(page_head)) { - if (PageHead(page_head)) - __put_compound_page(page_head); - else - __put_single_page(page_head); - } + __put_nontail_page(page_head); } else { /* page_head is a dangling pointer */ VM_BUG_ON(PageTail(page)); -- 1.5.5.1 ^ permalink raw reply related [flat|nested] 114+ messages in thread
* [PATCH -mm 2/7] mm: thp: change __get_page_tail() to use __put_nontail_page() 2013-12-18 19:19 ` [PATCH -mm 0/7] (Was: introduce get_compound_page) Oleg Nesterov 2013-12-18 19:19 ` [PATCH -mm 1/7] mm: thp: introduce __put_nontail_page() Oleg Nesterov @ 2013-12-18 19:19 ` Oleg Nesterov 2013-12-18 19:19 ` [PATCH -mm 3/7] mm: change release_pages() to use put_page() rather than put_compound_page() Oleg Nesterov ` (4 subsequent siblings) 6 siblings, 0 replies; 114+ messages in thread From: Oleg Nesterov @ 2013-12-18 19:19 UTC (permalink / raw) To: Andrea Arcangeli, Andrew Morton Cc: Thomas Gleixner, Linus Torvalds, Dave Jones, Darren Hart, Peter Zijlstra, Mel Gorman, linux-kernel __get_page_tail() can use __put_nontail_page() instead of put_page() if it races with split_huge_page(). This is what put_compound_page() does, so this is equally safe. And this allows us to factor out this code later. Signed-off-by: Oleg Nesterov <oleg@redhat.com> --- mm/swap.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/mm/swap.c b/mm/swap.c index f83de1f..aae24fe 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -289,7 +289,7 @@ bool __get_page_tail(struct page *page) } compound_unlock_irqrestore(page_head, flags); if (unlikely(!got)) - put_page(page_head); + __put_nontail_page(page_head); } return got; } -- 1.5.5.1 ^ permalink raw reply related [flat|nested] 114+ messages in thread
* [PATCH -mm 3/7] mm: change release_pages() to use put_page() rather than put_compound_page() 2013-12-18 19:19 ` [PATCH -mm 0/7] (Was: introduce get_compound_page) Oleg Nesterov 2013-12-18 19:19 ` [PATCH -mm 1/7] mm: thp: introduce __put_nontail_page() Oleg Nesterov 2013-12-18 19:19 ` [PATCH -mm 2/7] mm: thp: change __get_page_tail() to use __put_nontail_page() Oleg Nesterov @ 2013-12-18 19:19 ` Oleg Nesterov 2013-12-18 19:19 ` [PATCH -mm 4/7] mm: thp: turn put_compound_page() into __put_page_tail() Oleg Nesterov ` (3 subsequent siblings) 6 siblings, 0 replies; 114+ messages in thread From: Oleg Nesterov @ 2013-12-18 19:19 UTC (permalink / raw) To: Andrea Arcangeli, Andrew Morton Cc: Thomas Gleixner, Linus Torvalds, Dave Jones, Darren Hart, Peter Zijlstra, Mel Gorman, linux-kernel Change release_pages() to avoid put_compound_page() and simply call put_page(). This adds the additional PageCompound() check into the unlikely path but allows us to change the semantics of put_compound_page(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> --- mm/swap.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/mm/swap.c b/mm/swap.c index aae24fe..b0e65b6 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -819,7 +819,7 @@ void release_pages(struct page **pages, int nr, int cold) spin_unlock_irqrestore(&zone->lru_lock, flags); zone = NULL; } - put_compound_page(page); + put_page(page); continue; } -- 1.5.5.1 ^ permalink raw reply related [flat|nested] 114+ messages in thread
* [PATCH -mm 4/7] mm: thp: turn put_compound_page() into __put_page_tail() 2013-12-18 19:19 ` [PATCH -mm 0/7] (Was: introduce get_compound_page) Oleg Nesterov ` (2 preceding siblings ...) 2013-12-18 19:19 ` [PATCH -mm 3/7] mm: change release_pages() to use put_page() rather than put_compound_page() Oleg Nesterov @ 2013-12-18 19:19 ` Oleg Nesterov 2013-12-18 19:36 ` Peter Zijlstra 2013-12-18 19:20 ` [PATCH -mm 5/7] mm: thp: reorganize out_put_single code in __put_page_tail() Oleg Nesterov ` (2 subsequent siblings) 6 siblings, 1 reply; 114+ messages in thread From: Oleg Nesterov @ 2013-12-18 19:19 UTC (permalink / raw) To: Andrea Arcangeli, Andrew Morton Cc: Thomas Gleixner, Linus Torvalds, Dave Jones, Darren Hart, Peter Zijlstra, Mel Gorman, linux-kernel put_page(page) calls put_compound_page() if PageCompound(), but put_compound_page() falls back to __put_nontail_page() if the page is non-tail. So we can simply shift the PageTail() logic from put_compound_page() to put_page(), and put_page() can use __put_nontail_page() otherwise. The patch also renames put_compound_page() to __put_page_tail() to reflect the semantics change. And this way put_page() looks more symmetrical with get_page(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> --- mm/swap.c | 17 ++++++----------- 1 files changed, 6 insertions(+), 11 deletions(-) diff --git a/mm/swap.c b/mm/swap.c index b0e65b6..0400f3b 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -93,15 +93,10 @@ static void __put_nontail_page(struct page *page) } } -static void put_compound_page(struct page *page) +static void __put_page_tail(struct page *page) { struct page *page_head; - if (likely(!PageTail(page))) { - __put_nontail_page(page); - return; - } - /* __split_huge_page_refcount can run under us */ page_head = compound_trans_head(page); @@ -222,10 +217,10 @@ out_put_single: void put_page(struct page *page) { - if (unlikely(PageCompound(page))) - put_compound_page(page); - else if (put_page_testzero(page)) - __put_single_page(page); + if (unlikely(PageTail(page))) + __put_page_tail(page); + else + __put_nontail_page(page); } EXPORT_SYMBOL(put_page); @@ -247,7 +242,7 @@ bool __get_page_tail(struct page *page) bool got; struct page *page_head = compound_trans_head(page); - /* Ref to put_compound_page() comment. */ + /* Ref to __put_page_tail() comment. */ if (!__compound_tail_refcounted(page_head)) { smp_rmb(); if (likely(PageTail(page))) { -- 1.5.5.1 ^ permalink raw reply related [flat|nested] 114+ messages in thread
* Re: [PATCH -mm 4/7] mm: thp: turn put_compound_page() into __put_page_tail() 2013-12-18 19:19 ` [PATCH -mm 4/7] mm: thp: turn put_compound_page() into __put_page_tail() Oleg Nesterov @ 2013-12-18 19:36 ` Peter Zijlstra 2013-12-18 19:50 ` Oleg Nesterov 0 siblings, 1 reply; 114+ messages in thread From: Peter Zijlstra @ 2013-12-18 19:36 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrea Arcangeli, Andrew Morton, Thomas Gleixner, Linus Torvalds, Dave Jones, Darren Hart, Mel Gorman, linux-kernel On Wed, Dec 18, 2013 at 08:19:58PM +0100, Oleg Nesterov wrote: > @@ -247,7 +242,7 @@ bool __get_page_tail(struct page *page) > bool got; > struct page *page_head = compound_trans_head(page); > > - /* Ref to put_compound_page() comment. */ > + /* Ref to __put_page_tail() comment. */ > if (!__compound_tail_refcounted(page_head)) { > smp_rmb(); > if (likely(PageTail(page))) { What code is this against, my local tree doesn't have that smp_rmb(). This suggests its a recent patch; which is good since then we can still drop it and wait for people to send one with a proper comment in. ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH -mm 4/7] mm: thp: turn put_compound_page() into __put_page_tail() 2013-12-18 19:36 ` Peter Zijlstra @ 2013-12-18 19:50 ` Oleg Nesterov 0 siblings, 0 replies; 114+ messages in thread From: Oleg Nesterov @ 2013-12-18 19:50 UTC (permalink / raw) To: Peter Zijlstra Cc: Andrea Arcangeli, Andrew Morton, Thomas Gleixner, Linus Torvalds, Dave Jones, Darren Hart, Mel Gorman, linux-kernel On 12/18, Peter Zijlstra wrote: > > On Wed, Dec 18, 2013 at 08:19:58PM +0100, Oleg Nesterov wrote: > > @@ -247,7 +242,7 @@ bool __get_page_tail(struct page *page) > > bool got; > > struct page *page_head = compound_trans_head(page); > > > > - /* Ref to put_compound_page() comment. */ > > + /* Ref to __put_page_tail() comment. */ > > if (!__compound_tail_refcounted(page_head)) { > > smp_rmb(); > > if (likely(PageTail(page))) { > > What code is this against, my local tree doesn't have that smp_rmb(). This is against -mm code. > This suggests its a recent patch; which is good since then we can still > drop it and wait for people to send one with a proper comment in. put_compound_page() explains this barrier, this is what "Ref" above means, I guess. To remind, I think we can do more cleanups here. Oleg. ^ permalink raw reply [flat|nested] 114+ messages in thread
* [PATCH -mm 5/7] mm: thp: reorganize out_put_single code in __put_page_tail() 2013-12-18 19:19 ` [PATCH -mm 0/7] (Was: introduce get_compound_page) Oleg Nesterov ` (3 preceding siblings ...) 2013-12-18 19:19 ` [PATCH -mm 4/7] mm: thp: turn put_compound_page() into __put_page_tail() Oleg Nesterov @ 2013-12-18 19:20 ` Oleg Nesterov 2013-12-18 19:20 ` [PATCH -mm 6/7] mm: thp: introduce get_lock_thp_head() Oleg Nesterov 2013-12-18 19:20 ` [PATCH -mm 7/7] mm: thp: introduce compound_head_put_tail(), change get_futex_key() to use it Oleg Nesterov 6 siblings, 0 replies; 114+ messages in thread From: Oleg Nesterov @ 2013-12-18 19:20 UTC (permalink / raw) To: Andrea Arcangeli, Andrew Morton Cc: Thomas Gleixner, Linus Torvalds, Dave Jones, Darren Hart, Peter Zijlstra, Mel Gorman, linux-kernel The patch looks complicated but it is not. It simply moves the out_put_single label at the end of the function and changes the !__compound_tail_refcounted() block to save a tab stop. This cleanups the code layout a bit, and this is the preparation for the next change. Signed-off-by: Oleg Nesterov <oleg@redhat.com> --- mm/swap.c | 70 ++++++++++++++++++++++++------------------------------------ 1 files changed, 28 insertions(+), 42 deletions(-) diff --git a/mm/swap.c b/mm/swap.c index 0400f3b..5f3dda6 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -125,42 +125,29 @@ static void __put_page_tail(struct page *page) * can be freed and reallocated. */ smp_rmb(); - if (likely(PageTail(page))) { - /* - * __split_huge_page_refcount cannot race - * here. - */ - VM_BUG_ON(!PageHead(page_head)); - VM_BUG_ON(page_mapcount(page) != 0); - if (put_page_testzero(page_head)) { - /* - * If this is the tail of a slab - * compound page, the tail pin must - * not be the last reference held on - * the page, because the PG_slab - * cannot be cleared before all tail - * pins (which skips the _mapcount - * tail refcounting) have been - * released. For hugetlbfs the tail - * pin may be the last reference on - * the page instead, because - * PageHeadHuge will not go away until - * the compound page enters the buddy - * allocator. - */ - VM_BUG_ON(PageSlab(page_head)); - __put_compound_page(page_head); - } - return; - } else + if (unlikely(!PageTail(page))) + goto out_put_single; + + /* + * __split_huge_page_refcount cannot race here. + */ + VM_BUG_ON(!PageHead(page_head)); + VM_BUG_ON(page_mapcount(page) != 0); + if (put_page_testzero(page_head)) { /* - * __split_huge_page_refcount run before us, - * "page" was a THP tail. The split page_head - * has been freed and reallocated as slab or - * hugetlbfs page of smaller order (only - * possible if reallocated as slab on x86). + * If this is the tail of a slab compound page, the + * tail pin must not be the last reference held on + * the page, because the PG_slab cannot be cleared + * before all tail pins (which skips the _mapcount + * tail refcounting) have been released. For hugetlbfs + * the tail pin may be the last reference on the page + * instead, because PageHeadHuge will not go away until + * the compound page enters the buddy allocator. */ - goto out_put_single; + VM_BUG_ON(PageSlab(page_head)); + __put_compound_page(page_head); + } + return; } if (likely(page != page_head && get_page_unless_zero(page_head))) { @@ -186,10 +173,7 @@ static void __put_page_tail(struct page *page) * before the split. */ __put_nontail_page(page_head); -out_put_single: - if (put_page_testzero(page)) - __put_single_page(page); - return; + goto out_put_single; } VM_BUG_ON(page_head != page->first_page); /* @@ -208,11 +192,13 @@ out_put_single: compound_unlock_irqrestore(page_head, flags); __put_nontail_page(page_head); - } else { - /* page_head is a dangling pointer */ - VM_BUG_ON(PageTail(page)); - goto out_put_single; + return; } + +out_put_single: + VM_BUG_ON(PageTail(page)); + if (put_page_testzero(page)) + __put_single_page(page); } void put_page(struct page *page) -- 1.5.5.1 ^ permalink raw reply related [flat|nested] 114+ messages in thread
* [PATCH -mm 6/7] mm: thp: introduce get_lock_thp_head() 2013-12-18 19:19 ` [PATCH -mm 0/7] (Was: introduce get_compound_page) Oleg Nesterov ` (4 preceding siblings ...) 2013-12-18 19:20 ` [PATCH -mm 5/7] mm: thp: reorganize out_put_single code in __put_page_tail() Oleg Nesterov @ 2013-12-18 19:20 ` Oleg Nesterov 2013-12-18 21:37 ` Linus Torvalds 2013-12-18 19:20 ` [PATCH -mm 7/7] mm: thp: introduce compound_head_put_tail(), change get_futex_key() to use it Oleg Nesterov 6 siblings, 1 reply; 114+ messages in thread From: Oleg Nesterov @ 2013-12-18 19:20 UTC (permalink / raw) To: Andrea Arcangeli, Andrew Morton Cc: Thomas Gleixner, Linus Torvalds, Dave Jones, Darren Hart, Peter Zijlstra, Mel Gorman, linux-kernel Both __put_page_tail() and __get_page_tail() need to carefully take a reference on page_head, take compound_lock() and recheck PageTail(page) under this lock. This patch extracts this code into the new helper, it will have another user. This also means that __get_page_tail() gets the same VM_BUG_ON() checks. Note: this change can also help if we decide to change the locking, perhaps it makes sense to change __split_huge_page_refcount() to also do compound_lock/unlock(page_tail) in a loop. In this case it would be simple to adapt this helper and its usage. Signed-off-by: Oleg Nesterov <oleg@redhat.com> --- mm/swap.c | 99 +++++++++++++++++++++++++++--------------------------------- 1 files changed, 45 insertions(+), 54 deletions(-) diff --git a/mm/swap.c b/mm/swap.c index 5f3dda6..972923d 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -93,12 +93,45 @@ static void __put_nontail_page(struct page *page) } } -static void __put_page_tail(struct page *page) +static bool get_lock_thp_head(struct page *head, struct page *tail, + unsigned long *flags) { - struct page *page_head; + if (unlikely(head == tail || !get_page_unless_zero(head))) + return false; + + /* + * page_head wasn't a dangling pointer but it may not be a head + * page anymore by the time we obtain the lock. That is ok as + * long as it can't be freed from under us. + */ + *flags = compound_lock_irqsave(head); + if (likely(PageTail(tail))) { + VM_BUG_ON(head != tail->first_page); + VM_BUG_ON(atomic_read(&head->_count) <= 1); + VM_BUG_ON(atomic_read(&tail->_count) != 0); + VM_BUG_ON(page_mapcount(tail) <= 1); + return true; + } + + compound_unlock_irqrestore(head, *flags); + /* + * The head page may have been freed and reallocated as a compound + * page of smaller order and then freed again. All we know is that + * it cannot have become: a THP page, a compound page of higher + * order, a tail page. That is because we still hold the refcount + * of the split THP tail and page_head was the THP head before the + * split. + */ + __put_nontail_page(head); + return false; +} + +static void __put_page_tail(struct page *page) +{ /* __split_huge_page_refcount can run under us */ - page_head = compound_trans_head(page); + struct page *page_head = compound_trans_head(page); + unsigned long flags; /* * THP can not break up slab pages so avoid taking @@ -150,45 +183,16 @@ static void __put_page_tail(struct page *page) return; } - if (likely(page != page_head && get_page_unless_zero(page_head))) { - unsigned long flags; - - /* - * page_head wasn't a dangling pointer but it may not - * be a head page anymore by the time we obtain the - * lock. That is ok as long as it can't be freed from - * under us. - */ - flags = compound_lock_irqsave(page_head); - if (unlikely(!PageTail(page))) { - /* __split_huge_page_refcount run before us */ - compound_unlock_irqrestore(page_head, flags); - /* - * The head page may have been freed and reallocated - * as a compound page of smaller order and then freed - * again. All we know is that it cannot have become: - * a THP page, a compound page of higher order, a tail - * page. That is because we still hold the refcount of - * the split THP tail and page_head was the THP head - * before the split. - */ - __put_nontail_page(page_head); - goto out_put_single; - } - VM_BUG_ON(page_head != page->first_page); + if (likely(get_lock_thp_head(page_head, page, &flags))) { /* - * We can release the refcount taken by - * get_page_unless_zero() now that - * __split_huge_page_refcount() is blocked on the + * We can release the refcount taken by get_lock_thp_head() + * now that __split_huge_page_refcount() is blocked on the * compound_lock. */ if (put_page_testzero(page_head)) VM_BUG_ON(1); /* __split_huge_page_refcount will wait now */ - VM_BUG_ON(page_mapcount(page) <= 0); atomic_dec(&page->_mapcount); - VM_BUG_ON(atomic_read(&page_head->_count) <= 0); - VM_BUG_ON(atomic_read(&page->_count) != 0); compound_unlock_irqrestore(page_head, flags); __put_nontail_page(page_head); @@ -224,9 +228,8 @@ bool __get_page_tail(struct page *page) * proper PT lock that already serializes against * split_huge_page(). */ - unsigned long flags; - bool got; struct page *page_head = compound_trans_head(page); + unsigned long flags; /* Ref to __put_page_tail() comment. */ if (!__compound_tail_refcounted(page_head)) { @@ -254,25 +257,13 @@ bool __get_page_tail(struct page *page) } } - got = false; - if (likely(page != page_head && get_page_unless_zero(page_head))) { - /* - * page_head wasn't a dangling pointer but it - * may not be a head page anymore by the time - * we obtain the lock. That is ok as long as it - * can't be freed from under us. - */ - flags = compound_lock_irqsave(page_head); - /* here __split_huge_page_refcount won't run anymore */ - if (likely(PageTail(page))) { - __get_page_tail_foll(page, false); - got = true; - } + if (likely(get_lock_thp_head(page_head, page, &flags))) { + __get_page_tail_foll(page, false); compound_unlock_irqrestore(page_head, flags); - if (unlikely(!got)) - __put_nontail_page(page_head); + return true; } - return got; + + return false; } EXPORT_SYMBOL(__get_page_tail); -- 1.5.5.1 ^ permalink raw reply related [flat|nested] 114+ messages in thread
* Re: [PATCH -mm 6/7] mm: thp: introduce get_lock_thp_head() 2013-12-18 19:20 ` [PATCH -mm 6/7] mm: thp: introduce get_lock_thp_head() Oleg Nesterov @ 2013-12-18 21:37 ` Linus Torvalds 2013-12-19 16:29 ` Oleg Nesterov 0 siblings, 1 reply; 114+ messages in thread From: Linus Torvalds @ 2013-12-18 21:37 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrea Arcangeli, Andrew Morton, Thomas Gleixner, Dave Jones, Darren Hart, Peter Zijlstra, Mel Gorman, Linux Kernel Mailing List On Wed, Dec 18, 2013 at 11:20 AM, Oleg Nesterov <oleg@redhat.com> wrote: > Both __put_page_tail() and __get_page_tail() need to carefully > take a reference on page_head, take compound_lock() and recheck > PageTail(page) under this lock. Btw I suspect this is just disgustingly expensive, and I don't think there's a really good reason for it. May I suggest: - getting rid of the PG_compound_lock bit-lock bitlocks are expensive and unfair, and don't even get lockdep checking - replace it with a (small, say 32-256 entries) array of hashed sequence locks - just hash based on the "struct page" pointer, and teach this code to do a read_seqcount_begin/read_seqcount_retry sequence instead for the page lookup. I think you can get rid of all the irq disables too, and the sequence lock should be pure memory reads for the read-case that we care about. Hmm? This is obviously orthogonal to your series, I just reacted to seeing that bitlock thing that needs atomics for both locking and unlocking and the irq disable, and just generally looks like the worst possible way to do these things. Linus ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH -mm 6/7] mm: thp: introduce get_lock_thp_head() 2013-12-18 21:37 ` Linus Torvalds @ 2013-12-19 16:29 ` Oleg Nesterov 0 siblings, 0 replies; 114+ messages in thread From: Oleg Nesterov @ 2013-12-19 16:29 UTC (permalink / raw) To: Linus Torvalds Cc: Andrea Arcangeli, Andrew Morton, Thomas Gleixner, Dave Jones, Darren Hart, Peter Zijlstra, Mel Gorman, Linux Kernel Mailing List On 12/18, Linus Torvalds wrote: > > On Wed, Dec 18, 2013 at 11:20 AM, Oleg Nesterov <oleg@redhat.com> wrote: > > Both __put_page_tail() and __get_page_tail() need to carefully > > take a reference on page_head, take compound_lock() and recheck > > PageTail(page) under this lock. > > Btw I suspect this is just disgustingly expensive, and I don't think > there's a really good reason for it. > > May I suggest: > > - getting rid of the PG_compound_lock bit-lock > > bitlocks are expensive and unfair, and don't even get lockdep checking > > - replace it with a (small, say 32-256 entries) array of hashed sequence locks > > - just hash based on the "struct page" pointer, and teach this code > to do a read_seqcount_begin/read_seqcount_retry sequence instead for > the page lookup. Yes, I thought about this too but didn't dare to suggest. Not sure about seqlock/irqs and other details, this needs more discussion anyway. And of course I am not sure this will be actually better. > This is obviously orthogonal to your series, Yes. But please note that one of the reasons for the new helper is simplify the potential locking changes. The changelog only mentions "even more bitlocks" change, but this doesn't matter. And in fact I think that this allows to do more cleanups even if we do not change the locking, get_lock_thp_head() should return page_head or NULL, tail != head is just another PageTail() check. Perhaps. Oleg. ^ permalink raw reply [flat|nested] 114+ messages in thread
* [PATCH -mm 7/7] mm: thp: introduce compound_head_put_tail(), change get_futex_key() to use it 2013-12-18 19:19 ` [PATCH -mm 0/7] (Was: introduce get_compound_page) Oleg Nesterov ` (5 preceding siblings ...) 2013-12-18 19:20 ` [PATCH -mm 6/7] mm: thp: introduce get_lock_thp_head() Oleg Nesterov @ 2013-12-18 19:20 ` Oleg Nesterov 2013-12-18 19:28 ` Peter Zijlstra 6 siblings, 1 reply; 114+ messages in thread From: Oleg Nesterov @ 2013-12-18 19:20 UTC (permalink / raw) To: Andrea Arcangeli, Andrew Morton Cc: Thomas Gleixner, Linus Torvalds, Dave Jones, Darren Hart, Peter Zijlstra, Mel Gorman, linux-kernel 1. Add the new helper, compound_head_put_tail(page) which returns the stable compound_head() and drops the reference to this page if it is the sub-page of __compound_tail_refcounted(head). Note: this patch tries to be as simple as possible. But we need to unify the new helper with __put_page_tail() later, or at least factor out the common code. 2. Remove the nasty __get_user_pages_fast() code in get_futex_key(), it can use the new helper to get page_head. Note: with or without this change basepage_index(page) after put_page(page) looks confusing at least, this will be addressed later. Suggested-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> --- include/linux/mm.h | 2 ++ kernel/futex.c | 37 +------------------------------------ mm/swap.c | 35 +++++++++++++++++++++++++++++++++++ 3 files changed, 38 insertions(+), 36 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 13495bd..545df45 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -501,6 +501,8 @@ static inline void __ClearPageBuddy(struct page *page) void put_page(struct page *page); void put_pages_list(struct list_head *pages); +struct page *compound_head_put_tail(struct page *page); + void split_page(struct page *page, unsigned int order); int split_free_page(struct page *page); diff --git a/kernel/futex.c b/kernel/futex.c index c3a1a55..1fd7031 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -282,42 +282,7 @@ again: else err = 0; -#ifdef CONFIG_TRANSPARENT_HUGEPAGE - page_head = page; - if (unlikely(PageTail(page))) { - put_page(page); - /* serialize against __split_huge_page_splitting() */ - local_irq_disable(); - if (likely(__get_user_pages_fast(address, 1, 1, &page) == 1)) { - page_head = compound_head(page); - /* - * page_head is valid pointer but we must pin - * it before taking the PG_lock and/or - * PG_compound_lock. The moment we re-enable - * irqs __split_huge_page_splitting() can - * return and the head page can be freed from - * under us. We can't take the PG_lock and/or - * PG_compound_lock on a page that could be - * freed from under us. - */ - if (page != page_head) { - get_page(page_head); - put_page(page); - } - local_irq_enable(); - } else { - local_irq_enable(); - goto again; - } - } -#else - page_head = compound_head(page); - if (page != page_head) { - get_page(page_head); - put_page(page); - } -#endif - + page_head = compound_head_put_tail(page); lock_page(page_head); /* diff --git a/mm/swap.c b/mm/swap.c index 972923d..6742c85 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -215,6 +215,41 @@ void put_page(struct page *page) EXPORT_SYMBOL(put_page); /* + * Like compound_head() but also drops the additional reference + * this page can have. Unlike compound_head() it returns the page + * which has a reference, and can't race with split_huge_page(). + */ +struct page *compound_head_put_tail(struct page *page) +{ + struct page *page_head; + unsigned long flags; + + if (!PageTail(page)) + return page; + + page_head = compound_trans_head(page); + + if (!__compound_tail_refcounted(page_head)) { + smp_rmb(); + if (likely(PageTail(page))) + return page_head; + else + return page; + } + + if (likely(get_lock_thp_head(page_head, page, &flags))) { + if (put_page_testzero(page_head)) + VM_BUG_ON(1); + atomic_dec(&page->_mapcount); + compound_unlock_irqrestore(page_head, flags); + + return page_head; + } + + return page; +} + +/* * This function is exported but must not be called by anything other * than get_page(). It implements the slow path of get_page(). */ -- 1.5.5.1 ^ permalink raw reply related [flat|nested] 114+ messages in thread
* Re: [PATCH -mm 7/7] mm: thp: introduce compound_head_put_tail(), change get_futex_key() to use it 2013-12-18 19:20 ` [PATCH -mm 7/7] mm: thp: introduce compound_head_put_tail(), change get_futex_key() to use it Oleg Nesterov @ 2013-12-18 19:28 ` Peter Zijlstra 2013-12-18 19:40 ` Oleg Nesterov 0 siblings, 1 reply; 114+ messages in thread From: Peter Zijlstra @ 2013-12-18 19:28 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrea Arcangeli, Andrew Morton, Thomas Gleixner, Linus Torvalds, Dave Jones, Darren Hart, Mel Gorman, linux-kernel On Wed, Dec 18, 2013 at 08:20:08PM +0100, Oleg Nesterov wrote: > +struct page *compound_head_put_tail(struct page *page) > +{ > + struct page *page_head; > + unsigned long flags; > + > + if (!PageTail(page)) > + return page; > + > + page_head = compound_trans_head(page); > + > + if (!__compound_tail_refcounted(page_head)) { This barrier is missing a comment describing the ordering and the pairing. > + smp_rmb(); > + if (likely(PageTail(page))) > + return page_head; > + else > + return page; > + } > + > + if (likely(get_lock_thp_head(page_head, page, &flags))) { > + if (put_page_testzero(page_head)) > + VM_BUG_ON(1); > + atomic_dec(&page->_mapcount); > + compound_unlock_irqrestore(page_head, flags); > + > + return page_head; > + } > + > + return page; > +} ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH -mm 7/7] mm: thp: introduce compound_head_put_tail(), change get_futex_key() to use it 2013-12-18 19:28 ` Peter Zijlstra @ 2013-12-18 19:40 ` Oleg Nesterov 0 siblings, 0 replies; 114+ messages in thread From: Oleg Nesterov @ 2013-12-18 19:40 UTC (permalink / raw) To: Peter Zijlstra Cc: Andrea Arcangeli, Andrew Morton, Thomas Gleixner, Linus Torvalds, Dave Jones, Darren Hart, Mel Gorman, linux-kernel On 12/18, Peter Zijlstra wrote: > > On Wed, Dec 18, 2013 at 08:20:08PM +0100, Oleg Nesterov wrote: > > +struct page *compound_head_put_tail(struct page *page) > > +{ > > + struct page *page_head; > > + unsigned long flags; > > + > > + if (!PageTail(page)) > > + return page; > > + > > + page_head = compound_trans_head(page); > > + > > + if (!__compound_tail_refcounted(page_head)) { > > This barrier is missing a comment describing the ordering and the > pairing. Yes, and this function lacks other comments it needs, they should be copy-and-past'ed from the very similar __put_page_tail(). But I am going to unify them later if this series passes the review, can't the comments wait till then? OTOH, I won't mind to send v2 with the additional comment about this rmb() at least. Or perhaps I should simply add /* See the comments in __put_page_tail() */ into the new helper. OK. But I'll wait for other reviews first. Oleg. ^ permalink raw reply [flat|nested] 114+ messages in thread
* [PATCH 0/1] mm: fix the theoretical compound_lock() vs prep_new_page() race 2013-12-16 20:19 ` Andrea Arcangeli 2013-12-16 20:46 ` Oleg Nesterov @ 2013-12-19 19:08 ` Oleg Nesterov 2013-12-19 19:09 ` [PATCH 1/1] " Oleg Nesterov 2013-12-20 14:19 ` [PATCH 0/1] " Martin Schwidefsky 1 sibling, 2 replies; 114+ messages in thread From: Oleg Nesterov @ 2013-12-19 19:08 UTC (permalink / raw) To: Andrea Arcangeli Cc: Thomas Gleixner, Linus Torvalds, Dave Jones, Darren Hart, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman, Martin Schwidefsky, Heiko Carstens On 12/16, Andrea Arcangeli wrote: > > Can you reorder set_page_refcount in your v2? Please see the patch. > I wonder if arch_alloc_page needs refcount 1, it sets the page as > stable on s390. Obviously I have no idea what set_page_stable() does, but it works with page_to_phys(), unlikely the content of "struct page" can matter. And only s390 HAVE_ARCH_ALLOC_PAGE, I added Martin and Heiko. > the other way around is to move prep_compound_page > before set_page_refcounted (though I think if we can, keeping the > refcounted at the very last with a comment is preferable). Yes, yes, this looks much more natural. Oleg. ^ permalink raw reply [flat|nested] 114+ messages in thread
* [PATCH 1/1] mm: fix the theoretical compound_lock() vs prep_new_page() race 2013-12-19 19:08 ` [PATCH 0/1] mm: fix the theoretical compound_lock() vs prep_new_page() race Oleg Nesterov @ 2013-12-19 19:09 ` Oleg Nesterov 2013-12-23 11:43 ` Andrea Arcangeli 2013-12-20 14:19 ` [PATCH 0/1] " Martin Schwidefsky 1 sibling, 1 reply; 114+ messages in thread From: Oleg Nesterov @ 2013-12-19 19:09 UTC (permalink / raw) To: Andrea Arcangeli Cc: Thomas Gleixner, Linus Torvalds, Dave Jones, Darren Hart, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman, Martin Schwidefsky, Heiko Carstens get/put_page(thp_tail) paths do get_page_unless_zero(page_head) + compound_lock(). In theory this page_head can be already freed and reallocated as alloc_pages(__GFP_COMP, smaller_order). In this case get_page_unless_zero() can succeed right after set_page_refcounted(), and compound_lock() can race with the non-atomic __SetPageHead(). Perhaps we should rework the thp locking (under discussion), but until then this patch moves set_page_refcounted() and adds wmb() to ensure that page->_count != 0 comes as a last change. I am not sure about other callers of set_page_refcounted(), but at first glance they look fine to me. Signed-off-by: Oleg Nesterov <oleg@redhat.com> --- mm/page_alloc.c | 10 ++++++++-- 1 files changed, 8 insertions(+), 2 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 115b23b..9402337 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -865,8 +865,6 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags) } set_page_private(page, 0); - set_page_refcounted(page); - arch_alloc_page(page, order); kernel_map_pages(page, 1 << order, 1); @@ -876,6 +874,14 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags) if (order && (gfp_flags & __GFP_COMP)) prep_compound_page(page, order); + /* + * Make sure the caller of get_page_unless_zero() will see the + * fully initialized page. Say, to ensure that compound_lock() + * can't race with the non-atomic __SetPage*() above. + */ + smp_wmb(); + set_page_refcounted(page); + return 0; } -- 1.5.5.1 ^ permalink raw reply related [flat|nested] 114+ messages in thread
* Re: [PATCH 1/1] mm: fix the theoretical compound_lock() vs prep_new_page() race 2013-12-19 19:09 ` [PATCH 1/1] " Oleg Nesterov @ 2013-12-23 11:43 ` Andrea Arcangeli 2014-01-03 19:55 ` [PATCH v2 0/1] " Oleg Nesterov 0 siblings, 1 reply; 114+ messages in thread From: Andrea Arcangeli @ 2013-12-23 11:43 UTC (permalink / raw) To: Oleg Nesterov Cc: Thomas Gleixner, Linus Torvalds, Dave Jones, Darren Hart, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman, Martin Schwidefsky, Heiko Carstens On Thu, Dec 19, 2013 at 08:09:20PM +0100, Oleg Nesterov wrote: > get/put_page(thp_tail) paths do get_page_unless_zero(page_head) + > compound_lock(). In theory this page_head can be already freed and > reallocated as alloc_pages(__GFP_COMP, smaller_order). In this case > get_page_unless_zero() can succeed right after set_page_refcounted(), > and compound_lock() can race with the non-atomic __SetPageHead(). > > Perhaps we should rework the thp locking (under discussion), but > until then this patch moves set_page_refcounted() and adds wmb() > to ensure that page->_count != 0 comes as a last change. > > I am not sure about other callers of set_page_refcounted(), but at > first glance they look fine to me. > > Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Andrea Arcangeli <aarcange@redhat.com> Only one improvement possible, the smp_wmb() could have been put under CONFIG_TRANSPARENT_HUGEPAGE somehow. No difference for x86-64 though. Thanks, Andrea ^ permalink raw reply [flat|nested] 114+ messages in thread
* [PATCH v2 0/1] mm: fix the theoretical compound_lock() vs prep_new_page() race 2013-12-23 11:43 ` Andrea Arcangeli @ 2014-01-03 19:55 ` Oleg Nesterov 2014-01-03 19:55 ` [PATCH v2 1/1] " Oleg Nesterov 0 siblings, 1 reply; 114+ messages in thread From: Oleg Nesterov @ 2014-01-03 19:55 UTC (permalink / raw) To: Andrea Arcangeli, Andrew Morton Cc: Thomas Gleixner, Linus Torvalds, Dave Jones, Darren Hart, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman, Martin Schwidefsky, Heiko Carstens On 12/23, Andrea Arcangeli wrote: > > Acked-by: Andrea Arcangeli <aarcange@redhat.com> > > Only one improvement possible, the smp_wmb() could have been put under > CONFIG_TRANSPARENT_HUGEPAGE somehow. No difference for x86-64 though. I thought it would be better to serialize it with get_page_unless_zero() in general, but OK, As the changelog says I didn't find another problematic caller, so please see v2. I preserved your ack, thanks. And I would like to know your opinion about other changes I sent ;) Oleg. ^ permalink raw reply [flat|nested] 114+ messages in thread
* [PATCH v2 1/1] mm: fix the theoretical compound_lock() vs prep_new_page() race 2014-01-03 19:55 ` [PATCH v2 0/1] " Oleg Nesterov @ 2014-01-03 19:55 ` Oleg Nesterov 2014-01-03 21:00 ` Andrew Morton 0 siblings, 1 reply; 114+ messages in thread From: Oleg Nesterov @ 2014-01-03 19:55 UTC (permalink / raw) To: Andrea Arcangeli, Andrew Morton Cc: Thomas Gleixner, Linus Torvalds, Dave Jones, Darren Hart, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman, Martin Schwidefsky, Heiko Carstens get/put_page(thp_tail) paths do get_page_unless_zero(page_head) + compound_lock(). In theory this page_head can be already freed and reallocated as alloc_pages(__GFP_COMP, smaller_order). In this case get_page_unless_zero() can succeed right after set_page_refcounted(), and compound_lock() can race with the non-atomic __SetPageHead(). Perhaps we should rework the thp locking (under discussion), but until then this patch moves set_page_refcounted() and adds wmb() to ensure that page->_count != 0 comes as a last change. I am not sure about other callers of set_page_refcounted(), but at first glance they look fine to me. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Andrea Arcangeli <aarcange@redhat.com> --- mm/page_alloc.c | 12 ++++++++++-- 1 files changed, 10 insertions(+), 2 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 115b23b..d323102 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -865,8 +865,6 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags) } set_page_private(page, 0); - set_page_refcounted(page); - arch_alloc_page(page, order); kernel_map_pages(page, 1 << order, 1); @@ -876,6 +874,16 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags) if (order && (gfp_flags & __GFP_COMP)) prep_compound_page(page, order); + /* + * Make sure the caller of get_page_unless_zero() will see the + * fully initialized page. Say, to ensure that compound_lock() + * can't race with the non-atomic __SetPage*() above. + */ +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + smp_wmb(); +#endif + set_page_refcounted(page); + return 0; } -- 1.5.5.1 ^ permalink raw reply related [flat|nested] 114+ messages in thread
* Re: [PATCH v2 1/1] mm: fix the theoretical compound_lock() vs prep_new_page() race 2014-01-03 19:55 ` [PATCH v2 1/1] " Oleg Nesterov @ 2014-01-03 21:00 ` Andrew Morton 2014-01-04 16:43 ` Oleg Nesterov 0 siblings, 1 reply; 114+ messages in thread From: Andrew Morton @ 2014-01-03 21:00 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrea Arcangeli, Thomas Gleixner, Linus Torvalds, Dave Jones, Darren Hart, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman, Martin Schwidefsky, Heiko Carstens On Fri, 3 Jan 2014 20:55:47 +0100 Oleg Nesterov <oleg@redhat.com> wrote: > get/put_page(thp_tail) paths do get_page_unless_zero(page_head) + > compound_lock(). In theory this page_head can be already freed and > reallocated as alloc_pages(__GFP_COMP, smaller_order). In this case > get_page_unless_zero() can succeed right after set_page_refcounted(), > and compound_lock() can race with the non-atomic __SetPageHead(). Would be useful to mention that these things are happening inside prep_compound_opage() (yes?). > Perhaps we should rework the thp locking (under discussion), but > until then this patch moves set_page_refcounted() and adds wmb() > to ensure that page->_count != 0 comes as a last change. > > I am not sure about other callers of set_page_refcounted(), but at > first glance they look fine to me. I don't get it. We're in prep_new_page() - this page is freshly allocated and no other thread yet has any means by which to look it up and start fiddling with it? ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH v2 1/1] mm: fix the theoretical compound_lock() vs prep_new_page() race 2014-01-03 21:00 ` Andrew Morton @ 2014-01-04 16:43 ` Oleg Nesterov 2014-01-08 11:54 ` Mel Gorman 0 siblings, 1 reply; 114+ messages in thread From: Oleg Nesterov @ 2014-01-04 16:43 UTC (permalink / raw) To: Andrew Morton Cc: Andrea Arcangeli, Thomas Gleixner, Linus Torvalds, Dave Jones, Darren Hart, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman, Martin Schwidefsky, Heiko Carstens On 01/03, Andrew Morton wrote: > > On Fri, 3 Jan 2014 20:55:47 +0100 Oleg Nesterov <oleg@redhat.com> wrote: > > > get/put_page(thp_tail) paths do get_page_unless_zero(page_head) + > > compound_lock(). In theory this page_head can be already freed and > > reallocated as alloc_pages(__GFP_COMP, smaller_order). In this case > > get_page_unless_zero() can succeed right after set_page_refcounted(), > > and compound_lock() can race with the non-atomic __SetPageHead(). > > Would be useful to mention that these things are happening inside > prep_compound_opage() (yes?). Agreed. Added "in prep_compound_opage()" into the changelog: get/put_page(thp_tail) paths do get_page_unless_zero(page_head) + compound_lock(). In theory this page_head can be already freed and reallocated as alloc_pages(__GFP_COMP, smaller_order). In this case get_page_unless_zero() can succeed right after set_page_refcounted(), and compound_lock() can race with the non-atomic __SetPageHead() in prep_compound_page(). Perhaps we should rework the thp locking (under discussion), but until then this patch moves set_page_refcounted() and adds wmb() to ensure that page->_count != 0 comes as a last change. I am not sure about other callers of set_page_refcounted(), but at first glance they look fine to me. or should I send v3? > > Perhaps we should rework the thp locking (under discussion), but > > until then this patch moves set_page_refcounted() and adds wmb() > > to ensure that page->_count != 0 comes as a last change. > > > > I am not sure about other callers of set_page_refcounted(), but at > > first glance they look fine to me. > > I don't get it. We're in prep_new_page() - this page is freshly > allocated and no other thread yet has any means by which to look it up > and start fiddling with it? Yes, but thp can access this page_head via stale pointer, tail->first_page, if it races with split_huge_page_refcount(). In this case we rely on compound_lock() to detect this race, the problem is that compound_lock() itself can race with head_page->flags manipulations. For example, __get_page_tail() roughly does: // PageTail(page) was already checked page_head = page->first_page; /* WINDOW */ get_page_unless_zero(page_head); compound_lock(page_head); recheck PageTail(page) to ensure page_head is still valid However, in the WINDOW above, split_huge_page() can split this huge page. After that its head can be freed and reallocated. Of course, I don't think it is possible to hit this race in practice, but still this looks wrong. Oleg. ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH v2 1/1] mm: fix the theoretical compound_lock() vs prep_new_page() race 2014-01-04 16:43 ` Oleg Nesterov @ 2014-01-08 11:54 ` Mel Gorman 2014-01-08 13:14 ` Mel Gorman 2014-01-08 16:13 ` Oleg Nesterov 0 siblings, 2 replies; 114+ messages in thread From: Mel Gorman @ 2014-01-08 11:54 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, Andrea Arcangeli, Thomas Gleixner, Linus Torvalds, Dave Jones, Darren Hart, Linux Kernel Mailing List, Peter Zijlstra, Martin Schwidefsky, Heiko Carstens On Sat, Jan 04, 2014 at 05:43:47PM +0100, Oleg Nesterov wrote: > On 01/03, Andrew Morton wrote: > > > > On Fri, 3 Jan 2014 20:55:47 +0100 Oleg Nesterov <oleg@redhat.com> wrote: > > > > > get/put_page(thp_tail) paths do get_page_unless_zero(page_head) + > > > compound_lock(). In theory this page_head can be already freed and > > > reallocated as alloc_pages(__GFP_COMP, smaller_order). In this case > > > get_page_unless_zero() can succeed right after set_page_refcounted(), > > > and compound_lock() can race with the non-atomic __SetPageHead(). > > > > Would be useful to mention that these things are happening inside > > prep_compound_opage() (yes?). > > Agreed. Added "in prep_compound_opage()" into the changelog: > > get/put_page(thp_tail) paths do get_page_unless_zero(page_head) + > compound_lock(). In theory this page_head can be already freed and > reallocated as alloc_pages(__GFP_COMP, smaller_order). In this case > get_page_unless_zero() can succeed right after set_page_refcounted(), > and compound_lock() can race with the non-atomic __SetPageHead() in > prep_compound_page(). > > Perhaps we should rework the thp locking (under discussion), but > until then this patch moves set_page_refcounted() and adds wmb() > to ensure that page->_count != 0 comes as a last change. > > I am not sure about other callers of set_page_refcounted(), but at > first glance they look fine to me. > > or should I send v3? > This patch is putting a write barrier in the page allocator fast path and that is going to be a leading cause of Sad Face. We already have seen large regressions before when write barriers were introduced to the page allocator paths for cpusets. Sticking it under CONFIG_TRANSPARENT_HUGEPAGE does not really address the issue. > > > Perhaps we should rework the thp locking (under discussion), but > > > until then this patch moves set_page_refcounted() and adds wmb() > > > to ensure that page->_count != 0 comes as a last change. > > > > > > I am not sure about other callers of set_page_refcounted(), but at > > > first glance they look fine to me. > > > > I don't get it. We're in prep_new_page() - this page is freshly > > allocated and no other thread yet has any means by which to look it up > > and start fiddling with it? > > Yes, but thp can access this page_head via stale pointer, tail->first_page, > if it races with split_huge_page_refcount(). To justify the introduction of a performance regression we need to be 100% sure this race actually exists and not just theoretical. I had lost track of this thread but did not see a description of how this bug can actually happen. At the risk of sounding stupid -- what are the actual circumstances when this race can occur? For example, in the reclaim paths, we are going to be dealing with the head pages as they are on the LRU. They get split into base pages and then the compound handling becomes irrelevant. I cannot see problems there. For futex, the THP page (and the tail) must have been discovered via the page tables in which case the page tables are temporarily preventing the page being freed to the allocator. GUP fast paths are protected from parallel teardowns and the slow paths are protected from parallel unmaps and frees by mmap_sem. Compaction and other walkers mostly deal with the head and/or deal with pages on the LRU where there will be an elevated reference count. The changelog needs a specific example where the problem can occur and even then we should consider if there is an option other than smacking the page allocator. > In this case we rely on > compound_lock() to detect this race, the problem is that compound_lock() > itself can race with head_page->flags manipulations. > > For example, __get_page_tail() roughly does: > > // PageTail(page) was already checked > > page_head = page->first_page; > > /* WINDOW */ > > get_page_unless_zero(page_head); > > compound_lock(page_head); > > recheck PageTail(page) to ensure page_head is still valid > > However, in the WINDOW above, split_huge_page() can split this huge page. > After that its head can be freed and reallocated. Of course, I don't think > it is possible to hit this race in practice, but still this looks wrong. > I can't think of a reason why we would actually hit that race in practice but I might have tunnel vision due to disliking that barrier so much. Unless there is an example with no other possible solution, I do not think we should stick a write barrier into the page allocator fast path. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH v2 1/1] mm: fix the theoretical compound_lock() vs prep_new_page() race 2014-01-08 11:54 ` Mel Gorman @ 2014-01-08 13:14 ` Mel Gorman 2014-01-08 16:13 ` Oleg Nesterov 1 sibling, 0 replies; 114+ messages in thread From: Mel Gorman @ 2014-01-08 13:14 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, Andrea Arcangeli, Thomas Gleixner, Linus Torvalds, Dave Jones, Darren Hart, Linux Kernel Mailing List, Peter Zijlstra, Martin Schwidefsky, Heiko Carstens On Wed, Jan 08, 2014 at 11:54:00AM +0000, Mel Gorman wrote: > On Sat, Jan 04, 2014 at 05:43:47PM +0100, Oleg Nesterov wrote: > > On 01/03, Andrew Morton wrote: > > > > > > On Fri, 3 Jan 2014 20:55:47 +0100 Oleg Nesterov <oleg@redhat.com> wrote: > > > > > > > get/put_page(thp_tail) paths do get_page_unless_zero(page_head) + > > > > compound_lock(). In theory this page_head can be already freed and > > > > reallocated as alloc_pages(__GFP_COMP, smaller_order). In this case > > > > get_page_unless_zero() can succeed right after set_page_refcounted(), > > > > and compound_lock() can race with the non-atomic __SetPageHead(). > > > > > > Would be useful to mention that these things are happening inside > > > prep_compound_opage() (yes?). > > > > Agreed. Added "in prep_compound_opage()" into the changelog: > > > > get/put_page(thp_tail) paths do get_page_unless_zero(page_head) + > > compound_lock(). In theory this page_head can be already freed and > > reallocated as alloc_pages(__GFP_COMP, smaller_order). In this case > > get_page_unless_zero() can succeed right after set_page_refcounted(), > > and compound_lock() can race with the non-atomic __SetPageHead() in > > prep_compound_page(). > > > > Perhaps we should rework the thp locking (under discussion), but > > until then this patch moves set_page_refcounted() and adds wmb() > > to ensure that page->_count != 0 comes as a last change. > > > > I am not sure about other callers of set_page_refcounted(), but at > > first glance they look fine to me. > > > > or should I send v3? > > > > This patch is putting a write barrier in the page allocator fast path and > that is going to be a leading cause of Sad Face. Peter Zijlstra correctly pointed out to me that on x86 that we generally would not care/notice a write barrier as it almost always is a no-op. X86 (which is all I test any more) can execute an sfence for a smp_wmb but not in any configuration that matters. The previous barrier damage in page_alloc.c was due to full barriers but I generally assume barriers have a cost in core code when I see them regardless of the underlying architecture details. So 99% of the time, we will not care and I won't be making Sad Face but eventually someone using an affected architecture will whinge -- ppc64 probably as write barriers on sparc are compile barriers. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH v2 1/1] mm: fix the theoretical compound_lock() vs prep_new_page() race 2014-01-08 11:54 ` Mel Gorman 2014-01-08 13:14 ` Mel Gorman @ 2014-01-08 16:13 ` Oleg Nesterov 2014-01-08 18:02 ` Mel Gorman 1 sibling, 1 reply; 114+ messages in thread From: Oleg Nesterov @ 2014-01-08 16:13 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, Andrea Arcangeli, Thomas Gleixner, Linus Torvalds, Dave Jones, Darren Hart, Linux Kernel Mailing List, Peter Zijlstra, Martin Schwidefsky, Heiko Carstens On 01/08, Mel Gorman wrote: > > On Sat, Jan 04, 2014 at 05:43:47PM +0100, Oleg Nesterov wrote: > > > > get/put_page(thp_tail) paths do get_page_unless_zero(page_head) + > > compound_lock(). In theory this page_head can be already freed and > > reallocated as alloc_pages(__GFP_COMP, smaller_order). In this case > > get_page_unless_zero() can succeed right after set_page_refcounted(), > > and compound_lock() can race with the non-atomic __SetPageHead() in > > prep_compound_page(). > > > This patch is putting a write barrier in the page allocator fast path and > that is going to be a leading cause of Sad Face. We already have seen > large regressions before when write barriers were introduced to the page > allocator paths for cpusets. Sticking it under CONFIG_TRANSPARENT_HUGEPAGE > does not really address the issue. As you already mentioned in another email, smp_wmb() is mostly nop. On x86_64 at least. Although perhaps it would be nice to have static inline void atomic_store_release(atomic_t *v, int i) { smp_store_release(&v->counter, i); } > > Yes, but thp can access this page_head via stale pointer, tail->first_page, > > if it races with split_huge_page_refcount(). > > To justify the introduction of a performance regression we need to be 100% > sure this race actually exists See below. But let me remind that I never looked at this code before, I can be easily wrong. > and not just theoretical. It is theoretical anyway, I guess. > For futex, the THP page (and the tail) must have been discovered via > the page tables in which case the page tables are temporarily preventing > the page being freed to the allocator. Yes. But, for example, get_futex_key() does if (unlikely(PageTail(page))) { put_page(page); why this put_page() can't race with _split? If nothing else, another thread can unmap the part of this vma. > > For example, __get_page_tail() roughly does: > > > > // PageTail(page) was already checked > > > > page_head = page->first_page; > > > > /* WINDOW */ > > > > get_page_unless_zero(page_head); > > > > compound_lock(page_head); > > > > recheck PageTail(page) to ensure page_head is still valid > > > > However, in the WINDOW above, split_huge_page() can split this huge page. > > After that its head can be freed and reallocated. Of course, I don't think > > it is possible to hit this race in practice, but still this looks wrong. > > > > I can't think of a reason why we would actually hit that race in practice Agreed, the window is tiny, unlikely this possible. > I do not think we > should stick a write barrier into the page allocator fast path. OK, I won't argue, I leave this to you and Andrea. But I still think this code needs other cleanups/simplifications. In particular get_futex_key()->__get_user_pages_fast() should die imho. Oleg. ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH v2 1/1] mm: fix the theoretical compound_lock() vs prep_new_page() race 2014-01-08 16:13 ` Oleg Nesterov @ 2014-01-08 18:02 ` Mel Gorman 2014-01-08 19:04 ` Oleg Nesterov 0 siblings, 1 reply; 114+ messages in thread From: Mel Gorman @ 2014-01-08 18:02 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, Andrea Arcangeli, Thomas Gleixner, Linus Torvalds, Dave Jones, Darren Hart, Linux Kernel Mailing List, Peter Zijlstra, Martin Schwidefsky, Heiko Carstens On Wed, Jan 08, 2014 at 05:13:38PM +0100, Oleg Nesterov wrote: > On 01/08, Mel Gorman wrote: > > > > On Sat, Jan 04, 2014 at 05:43:47PM +0100, Oleg Nesterov wrote: > > > > > > get/put_page(thp_tail) paths do get_page_unless_zero(page_head) + > > > compound_lock(). In theory this page_head can be already freed and > > > reallocated as alloc_pages(__GFP_COMP, smaller_order). In this case > > > get_page_unless_zero() can succeed right after set_page_refcounted(), > > > and compound_lock() can race with the non-atomic __SetPageHead() in > > > prep_compound_page(). > > > > > This patch is putting a write barrier in the page allocator fast path and > > that is going to be a leading cause of Sad Face. We already have seen > > large regressions before when write barriers were introduced to the page > > allocator paths for cpusets. Sticking it under CONFIG_TRANSPARENT_HUGEPAGE > > does not really address the issue. > > As you already mentioned in another email, smp_wmb() is mostly nop. On > x86_64 at least. Which sometimes means that it'll just take longer for someone to find it and bitch about it. > Although perhaps it would be nice to have > > static inline void atomic_store_release(atomic_t *v, int i) > { > smp_store_release(&v->counter, i); > } > > > > Yes, but thp can access this page_head via stale pointer, tail->first_page, > > > if it races with split_huge_page_refcount(). > > > > To justify the introduction of a performance regression we need to be 100% > > sure this race actually exists > > See below. But let me remind that I never looked at this code before, > I can be easily wrong. > > > and not just theoretical. > > It is theoretical anyway, I guess. > > > For futex, the THP page (and the tail) must have been discovered via > > the page tables in which case the page tables are temporarily preventing > > the page being freed to the allocator. > > Yes. But, for example, get_futex_key() does > > if (unlikely(PageTail(page))) { > put_page(page); > > why this put_page() can't race with _split? If nothing else, another thread > can unmap the part of this vma. > The race is not prevented but that does not mean it matters. Basic scenario where a split starts after the PageTail check but before the put_page in get_futex_key CPU A get_futex_key -> fast gup, page table removing prevents parallel unmap and free -> gup_huge_pmd (arch/x86/mm/gup.c at least) -> get_huge_page_tail (increment page tail _map_count) -> get_huge_page_multiple (increment ref on head page) -> Check PageTail CPU B split_huge_page_to_list -> split_huge_page_refcount spin_lock_irq(lru_lock) compound_lock -> put_page(tail_page) ->put_compound_page looks up head page takes reference unless zero compound_lock (block) complete split compound_unlock check PageTail This put_page blocks on the compound lock, finds the page is no longer a PageTail as the split barriers correctly and backs out gracefully. So sure, splits can race but the case is cared for. The parallel unmap is prevented by get_huge_page_multiple in the gup path and held in place until put_page_compound frees it later. I still don't see the case where a page gets freed to the page allocator that causes weird problems here. Unfortunately, I also recognise I have tunnel vision because subconsciously I don't *want* to see a problem here that justifies adding a write barrier. Andrea may stomp all over this reasoning in which case we'll get a good comment for the smp_wmb :/ -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH v2 1/1] mm: fix the theoretical compound_lock() vs prep_new_page() race 2014-01-08 18:02 ` Mel Gorman @ 2014-01-08 19:04 ` Oleg Nesterov 2014-01-09 11:27 ` Mel Gorman 0 siblings, 1 reply; 114+ messages in thread From: Oleg Nesterov @ 2014-01-08 19:04 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, Andrea Arcangeli, Thomas Gleixner, Linus Torvalds, Dave Jones, Darren Hart, Linux Kernel Mailing List, Peter Zijlstra, Martin Schwidefsky, Heiko Carstens On 01/08, Mel Gorman wrote: > > On Wed, Jan 08, 2014 at 05:13:38PM +0100, Oleg Nesterov wrote: > > > > Yes. But, for example, get_futex_key() does > > > > if (unlikely(PageTail(page))) { > > put_page(page); > > > > why this put_page() can't race with _split? If nothing else, another thread > > can unmap the part of this vma. > > > > The race is not prevented but that does not mean it matters. Basic > scenario where a split starts after the PageTail check but before the > put_page in get_futex_key > > CPU A > get_futex_key > -> fast gup, page table removing prevents parallel unmap and free > -> gup_huge_pmd (arch/x86/mm/gup.c at least) > -> get_huge_page_tail (increment page tail _map_count) > -> get_huge_page_multiple (increment ref on head page) > -> Check PageTail > CPU B > split_huge_page_to_list > -> split_huge_page_refcount > spin_lock_irq(lru_lock) > compound_lock > -> put_page(tail_page) > ->put_compound_page > looks up head page Yes. But suppose that CPU B completes split_huge_page_to_list/munmap/etc and frees this head page. > takes reference unless zero suppose this page_head was reallocated and get_page_unless_zero() succeds right after set_page_refcounted(), > compound_lock (block) > complete split > compound_unlock > check PageTail > > This put_page blocks on the compound lock, finds the page is no longer a > PageTail Sure. The problem is that compound_lock() itself can race with prep_new_page() or I missed something. > The parallel unmap is prevented by get_huge_page_multiple in the gup path > and held in place until put_page_compound frees it later. Again, I can easily miss something. And yes, page_tail returned by gup has a reference to its page_head (via page_head->_count). But __split_huge_page_refcount() destroys this connection and decrements page_head->_count. Oleg. ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH v2 1/1] mm: fix the theoretical compound_lock() vs prep_new_page() race 2014-01-08 19:04 ` Oleg Nesterov @ 2014-01-09 11:27 ` Mel Gorman 2014-01-09 14:04 ` Oleg Nesterov 0 siblings, 1 reply; 114+ messages in thread From: Mel Gorman @ 2014-01-09 11:27 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, Andrea Arcangeli, Thomas Gleixner, Linus Torvalds, Dave Jones, Darren Hart, Linux Kernel Mailing List, Peter Zijlstra, Martin Schwidefsky, Heiko Carstens On Wed, Jan 08, 2014 at 08:04:43PM +0100, Oleg Nesterov wrote: > On 01/08, Mel Gorman wrote: > > > > On Wed, Jan 08, 2014 at 05:13:38PM +0100, Oleg Nesterov wrote: > > > > > > Yes. But, for example, get_futex_key() does > > > > > > if (unlikely(PageTail(page))) { > > > put_page(page); > > > > > > why this put_page() can't race with _split? If nothing else, another thread > > > can unmap the part of this vma. > > > > > > > The race is not prevented but that does not mean it matters. Basic > > scenario where a split starts after the PageTail check but before the > > put_page in get_futex_key > > > > CPU A > > get_futex_key > > -> fast gup, page table removing prevents parallel unmap and free > > -> gup_huge_pmd (arch/x86/mm/gup.c at least) > > -> get_huge_page_tail (increment page tail _map_count) > > -> get_huge_page_multiple (increment ref on head page) > > -> Check PageTail > > CPU B > > split_huge_page_to_list > > -> split_huge_page_refcount > > spin_lock_irq(lru_lock) > > compound_lock > > -> put_page(tail_page) > > ->put_compound_page > > looks up head page > > Yes. > > But suppose that CPU B completes split_huge_page_to_list/munmap/etc > and frees this head page. > Where did the reference taken by get_huge_page_multiple go? CPU A static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr, unsigned long end, int write, struct page **pages, int *nr) { .... do { ... if (PageTail(page)) /* Increment page->_mapcount */ get_huge_page_tail(page); ... refs++; } while (...) get_head_page_multiple(head, refs); } CPU A in get_futex_key has taken multiple references to the head page, one for every base page on the huge page Now the splitter comes along which does a bunch of stuff but the important part is in __split_huge_page_refcount() static void __split_huge_page_refcount(struct page *page, struct list_head *list) { ... spin_lock_irq(&zone->lru_lock); compound_lock(page); for_every_tail_page() { /* This picks up refcounts from GUP get_huge_page_tail */ tail_count += page_mapcount(page_tail); /* Propogate all mapcounts to the "real" refcount in the tail page */ atomic_add(page_mapcount(head) + page_mapcount(tail), tail->_count) .... flag reinits with barriers ... } atomic_sub(tail_count, headpage->_count); ... unlock stuff } The refcounts on page->_mapcount taken while the page was huge is propogated to the tail pages so it's still pinned in place. > > takes reference unless zero > > suppose this page_head was reallocated and get_page_unless_zero() > succeds right after set_page_refcounted(), > You're right. The head page can still be freed and reallocated as a *smaller* compound page but futex.c is doing the reference count on the tail page that should have an elevated count even after the split #ifdef CONFIG_TRANSPARENT_HUGEPAGE page_head = page; if (unlikely(PageTail(page))) { put_page(page); so I'm still not seeing how a tail page racing with a split ends up with mayhem. > > compound_lock (block) > > complete split > > compound_unlock > > check PageTail > > > > This put_page blocks on the compound lock, finds the page is no longer a > > PageTail > > Sure. The problem is that compound_lock() itself can race with prep_new_page() > or I missed something. > I could also still be stuck in a "la la la, everything is fine" mode. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH v2 1/1] mm: fix the theoretical compound_lock() vs prep_new_page() race 2014-01-09 11:27 ` Mel Gorman @ 2014-01-09 14:04 ` Oleg Nesterov 2014-01-09 18:52 ` Andrea Arcangeli 2014-01-10 16:12 ` Mel Gorman 0 siblings, 2 replies; 114+ messages in thread From: Oleg Nesterov @ 2014-01-09 14:04 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, Andrea Arcangeli, Thomas Gleixner, Linus Torvalds, Dave Jones, Darren Hart, Linux Kernel Mailing List, Peter Zijlstra, Martin Schwidefsky, Heiko Carstens On 01/09, Mel Gorman wrote: > > On Wed, Jan 08, 2014 at 08:04:43PM +0100, Oleg Nesterov wrote: > > > > Yes. > > > > But suppose that CPU B completes split_huge_page_to_list/munmap/etc > > and frees this head page. > > > > Where did the reference taken by get_huge_page_multiple go? In __split_huge_page_refcount(), see below. > CPU A > static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr, > unsigned long end, int write, struct page **pages, int *nr) > { > .... > do { > ... > if (PageTail(page)) > /* Increment page->_mapcount */ > get_huge_page_tail(page); > ... > refs++; > } while (...) > get_head_page_multiple(head, refs); > } > > CPU A in get_futex_key has taken multiple references to the head page, > one for every base page on the huge page Not sure I understand "multiple references to the head page" above... I mean, in this particular case case nr == 1. IOW, If gup returns a tail page, this page_tail has 1 in ->_mapcount (to simplify) and its ->first_page gets the additional 1 in ->_count. > static void __split_huge_page_refcount(struct page *page, > struct list_head *list) > { > ... > spin_lock_irq(&zone->lru_lock); > compound_lock(page); > > for_every_tail_page() { > /* This picks up refcounts from GUP get_huge_page_tail */ > tail_count += page_mapcount(page_tail); > > /* Propogate all mapcounts to the "real" refcount in the tail page */ > atomic_add(page_mapcount(head) + page_mapcount(tail), tail->_count) > > .... flag reinits with barriers ... > } > atomic_sub(tail_count, headpage->_count); > ... > unlock stuff > } > > The refcounts on page->_mapcount taken while the page was huge is > propogated to the tail pages so it's still pinned in place. Yes. But at the same time atomic_sub(tail_count, headpage->_count) above reverts the result of get_head_page_multiple(head) done by gup() above. IOW, after __split_huge_page_refcount() page_tail no longer pins its former page_head. > > > takes reference unless zero > > > > suppose this page_head was reallocated and get_page_unless_zero() > > succeds right after set_page_refcounted(), > > > > You're right. The head page can still be freed and reallocated as a *smaller* > compound page but futex.c is doing the reference count on the tail page > that should have an elevated count even after the split Yes, page_tail can't go away, the reference was moved to page_tail->_count. > #ifdef CONFIG_TRANSPARENT_HUGEPAGE > page_head = page; > if (unlikely(PageTail(page))) { > put_page(page); > > > so I'm still not seeing how a tail page racing with a split ends up with > mayhem. But get/put(page_tail) plays with page_head which can be freed/reallocated, it does compound_lock(page_head). > I could also still be stuck in a "la la la, everything is fine" mode. More likely it is me who tries to deny the fact I missed something ;) But let me try again. Lets ignore PageSlab/PageHeadHuge. put_compound_page() is complicated, but roughly it does: CPU 0 CPU 1 if (!PageTail(page_tail)) return; page_head = page_tail->first_page; unmap this vma, free everything. page_tail can't go away, its ->_count was incremented by __split_huge_page_refcount(). alloc_pages(GFP_COMP, 1) reallocates page_head, prep_new_page() does set_page_refcounted(page_head), // succeeds after set_page_refcounted() get_page_unless_zero(page_head); compound_lock(page_head) prep_compound_page(page_head); Now, both compound_lock() and prep_compound_page() play with the same page_head->flags, but __SetPageHead(page_head) is non-atomic. OK. Even if I am right, we can probably make another fix. put_compound_page() and __get_page_tail() can do yet another PageTail() check _before_ compound_lock(). Although personally I'd prefer this patch. And if we change get/put I think it would be better to do this on top of "[PATCH -mm 6/7] mm: thp: introduce get_lock_thp_head()" http://marc.info/?l=linux-kernel&m=138739438800899 Oleg. ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH v2 1/1] mm: fix the theoretical compound_lock() vs prep_new_page() race 2014-01-09 14:04 ` Oleg Nesterov @ 2014-01-09 18:52 ` Andrea Arcangeli 2014-01-09 19:43 ` Oleg Nesterov 2014-01-10 16:12 ` Mel Gorman 1 sibling, 1 reply; 114+ messages in thread From: Andrea Arcangeli @ 2014-01-09 18:52 UTC (permalink / raw) To: Oleg Nesterov Cc: Mel Gorman, Andrew Morton, Thomas Gleixner, Linus Torvalds, Dave Jones, Darren Hart, Linux Kernel Mailing List, Peter Zijlstra, Martin Schwidefsky, Heiko Carstens On Thu, Jan 09, 2014 at 03:04:47PM +0100, Oleg Nesterov wrote: > OK. Even if I am right, we can probably make another fix. I think the confusion here was to think this was related to the futex code, it isn't. This was just a generic theoretical problem found doing the futex cleanups but it's not related to the futex code. > put_compound_page() and __get_page_tail() can do yet another PageTail() > check _before_ compound_lock(). The above alternate fix looks good to me too. Only thing to sort out is in the common code (not just x86) then we may need a smp_mb() between PageTail check and the bit_spin_lock... We just can't risk writing the bit_spin_lock before reading PageTail. Not sure if the branch that skips the bit_spin_lock helps on some arch and we can depend on that but I doubt. smp_mb and smp_rmb are not noops like smp_wmb on x86 so the other patch would be more efficient on x86 if we have to add a smp_mb() before compound_lock(), but then the put/get_page slow path is only taken when releasing or getting tail pages after gup_fast. And regardless of gup_fast, like Linus said, for increased NUMA fairness we could move the compound lock from page->flags to an hashed array of proper spinlocks sized in function of ram. The contention on these locks is so low that I doubt we can run into lock starvation, but because the contention is so low, the array would be fine as well, and it would be more theoretically correct for NUMA usages than the bit spinlock. So this problem also goes away if we convert the bit_spin_lock to an hashed array of spin_lock. So in the longer term it doesn't matter how we fix it now, and we should document it in the fix that the additional PageTail check should be dropped after converting the bit_spin_lock to an array of hashed spinlocks (or the smp_wmb() if we go with the previous fix). I personally prefer to keep the complexity in one place so adding to get/put_page but the other way is a bit more efficient for x86 maybe (until we have smp_mb__before_spin_lock/bit_spin_lock at least). > Although personally I'd prefer this patch. And if we change get/put > I think it would be better to do this on top of > > "[PATCH -mm 6/7] mm: thp: introduce get_lock_thp_head()" > http://marc.info/?l=linux-kernel&m=138739438800899 Not against the cleanups of course, but about the order, it gets harder to backport it for distros if applied after the cleanups. Thanks! Andrea ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH v2 1/1] mm: fix the theoretical compound_lock() vs prep_new_page() race 2014-01-09 18:52 ` Andrea Arcangeli @ 2014-01-09 19:43 ` Oleg Nesterov 2014-01-09 21:36 ` Andrea Arcangeli 0 siblings, 1 reply; 114+ messages in thread From: Oleg Nesterov @ 2014-01-09 19:43 UTC (permalink / raw) To: Andrea Arcangeli Cc: Mel Gorman, Andrew Morton, Thomas Gleixner, Linus Torvalds, Dave Jones, Darren Hart, Linux Kernel Mailing List, Peter Zijlstra, Martin Schwidefsky, Heiko Carstens On 01/09, Andrea Arcangeli wrote: > > On Thu, Jan 09, 2014 at 03:04:47PM +0100, Oleg Nesterov wrote: > > OK. Even if I am right, we can probably make another fix. > > I think the confusion here was to think this was related to the futex > code, it isn't. This was just a generic theoretical problem found > doing the futex cleanups but it's not related to the futex code. Yes, yes, sure. I mentioned get_futex_key() just for example. > > put_compound_page() and __get_page_tail() can do yet another PageTail() > > check _before_ compound_lock(). > > The above alternate fix looks good to me too. > > Only thing to sort out is in the common code (not just x86) then we > may need a smp_mb() between PageTail check and the bit_spin_lock... We > just can't risk writing the bit_spin_lock before reading PageTail. I do not think we need mb() in between... other callers of compound_lock() look fine, get/put(page_tail) can't have the false positive after successful get_page_unless_zero(), and recently it was documented that the kernel can rely on the control dependency to serialize LOAD + STORE. But we probably need barrier() in between, we can't use ACCESS_ONCE(). > And regardless of gup_fast, like Linus said, for increased NUMA > fairness we could move the compound lock from page->flags to an hashed > array of proper spinlocks sized in function of ram. The contention on > these locks is so low that I doubt we can run into lock starvation, > but because the contention is so low, the array would be fine as well, > and it would be more theoretically correct for NUMA usages than the > bit spinlock. So this problem also goes away if we convert the > bit_spin_lock to an hashed array of spin_lock. Yes. But in this case I really think we should cleanup get/put first and add the helper, like the patch I mentioned does. > I personally prefer to keep the complexity in one place so adding to > get/put_page OK. I'll send v3. > > Although personally I'd prefer this patch. And if we change get/put > > I think it would be better to do this on top of > > > > "[PATCH -mm 6/7] mm: thp: introduce get_lock_thp_head()" > > http://marc.info/?l=linux-kernel&m=138739438800899 > > Not against the cleanups of course, but about the order, it gets > harder to backport it for distros if applied after the cleanups. Oh, I don't think this highly theoreitical fix should be backported but I agree, lets fix the bug first. Oleg. ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH v2 1/1] mm: fix the theoretical compound_lock() vs prep_new_page() race 2014-01-09 19:43 ` Oleg Nesterov @ 2014-01-09 21:36 ` Andrea Arcangeli 2014-01-10 16:12 ` Oleg Nesterov 0 siblings, 1 reply; 114+ messages in thread From: Andrea Arcangeli @ 2014-01-09 21:36 UTC (permalink / raw) To: Oleg Nesterov Cc: Mel Gorman, Andrew Morton, Thomas Gleixner, Linus Torvalds, Dave Jones, Darren Hart, Linux Kernel Mailing List, Peter Zijlstra, Martin Schwidefsky, Heiko Carstens On Thu, Jan 09, 2014 at 08:43:50PM +0100, Oleg Nesterov wrote: > get_page_unless_zero(), and recently it was documented that the kernel can > rely on the control dependency to serialize LOAD + STORE. Ok, that's cheap then. > > But we probably need barrier() in between, we can't use ACCESS_ONCE(). After get_page_unless_zero I don't think there's any need of barrier(). barrier() should have been implicit in __atomic_add_unless in fact it should be a full smp_mb() equivalent too. Memory is always clobbered there and the asm is volatile. My wondering was only about the runtime (not compiler) barrier after running PageTail and before compound_lock, because bit_spin_lock has only acquire semantics so in absence of the branch that bails out the lock, the spinlock could run before PageTail. If the branch is good enough guarantee for all archs it's good and cheap solution. Clearly this is not an x86 concern, which is always fine without anything when surrounded by locked ops like here. > Yes. But in this case I really think we should cleanup get/put first > and add the helper, like the patch I mentioned does. Ok, up to you, I'll check it. > Oh, I don't think this highly theoreitical fix should be backported > but I agree, lets fix the bug first. I think it should, but the risk free version of it, so either a few liner addition before compound_lock or the previous with smp_wmb() inside the ifdef (considering it only matters on x86 where smp_wmb is zero cost and the only operational change is actually the reordering of the set_page_refcounted not the smp_wmb irrelevant for x86). ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH v2 1/1] mm: fix the theoretical compound_lock() vs prep_new_page() race 2014-01-09 21:36 ` Andrea Arcangeli @ 2014-01-10 16:12 ` Oleg Nesterov 2014-01-10 16:50 ` Peter Zijlstra 0 siblings, 1 reply; 114+ messages in thread From: Oleg Nesterov @ 2014-01-10 16:12 UTC (permalink / raw) To: Andrea Arcangeli, Paul E. McKenney Cc: Mel Gorman, Andrew Morton, Thomas Gleixner, Linus Torvalds, Dave Jones, Darren Hart, Linux Kernel Mailing List, Peter Zijlstra, Martin Schwidefsky, Heiko Carstens On 01/09, Andrea Arcangeli wrote: > > > > > But we probably need barrier() in between, we can't use ACCESS_ONCE(). > > After get_page_unless_zero I don't think there's any need of > barrier(). barrier() should have been implicit in __atomic_add_unless > in fact it should be a full smp_mb() equivalent too. Memory is always > clobbered there and the asm is volatile. Yes, yes, > My wondering was only about the runtime (not compiler) barrier after > running PageTail and before compound_lock, Yes, this is what I meant. Except I really meant the compiler barrier, although I do not think it is actually needed, test_and_set_bit() implies mb(). > because bit_spin_lock has > only acquire semantics so in absence of the branch that bails out the > lock, the spinlock could run before PageTail. If the branch is good > enough guarantee for all archs it's good and cheap solution. The recent "[PATCH v6 tip/core/locking 3/8] Documentation/memory-barriers.txt: Prohibit speculative writes" from Paul says: No SMP architecture currently supporting Linux allows speculative writes, ... +ACCESS_ONCE(), which preserves the ordering between +the load from variable 'a' and the store to variable 'b': + + q = ACCESS_ONCE(a); + if (q) { + ACCESS_ONCE(b) = p; + do_something(); + } We can't use ACCESS_ONCE(), but I think that if (PageTail(page)) { barrier(); compound_lock(page_head); } should obviously work (even if compound_lock() didn't imply mb). Oleg. ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH v2 1/1] mm: fix the theoretical compound_lock() vs prep_new_page() race 2014-01-10 16:12 ` Oleg Nesterov @ 2014-01-10 16:50 ` Peter Zijlstra 0 siblings, 0 replies; 114+ messages in thread From: Peter Zijlstra @ 2014-01-10 16:50 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrea Arcangeli, Paul E. McKenney, Mel Gorman, Andrew Morton, Thomas Gleixner, Linus Torvalds, Dave Jones, Darren Hart, Linux Kernel Mailing List, Martin Schwidefsky, Heiko Carstens On Fri, Jan 10, 2014 at 05:12:27PM +0100, Oleg Nesterov wrote: > The recent "[PATCH v6 tip/core/locking 3/8] Documentation/memory-barriers.txt: > Prohibit speculative writes" from Paul says: > > No SMP architecture currently supporting Linux allows speculative writes, > > ... > > +ACCESS_ONCE(), which preserves the ordering between > +the load from variable 'a' and the store to variable 'b': > + > + q = ACCESS_ONCE(a); > + if (q) { > + ACCESS_ONCE(b) = p; > + do_something(); > + } > > > We can't use ACCESS_ONCE(), but I think that > > if (PageTail(page)) { > barrier(); > compound_lock(page_head); > } > > should obviously work (even if compound_lock() didn't imply mb). The compiler can actually screw you over if that's preceded by something like: SetPageTail(page). In which case it can prove that if (PageTail()) is a non-condition. But yes, barring that, the version with barrier() in should stop the compiler from doing most terrible things and it ought to work out. ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH v2 1/1] mm: fix the theoretical compound_lock() vs prep_new_page() race 2014-01-09 14:04 ` Oleg Nesterov 2014-01-09 18:52 ` Andrea Arcangeli @ 2014-01-10 16:12 ` Mel Gorman 1 sibling, 0 replies; 114+ messages in thread From: Mel Gorman @ 2014-01-10 16:12 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, Andrea Arcangeli, Thomas Gleixner, Linus Torvalds, Dave Jones, Darren Hart, Linux Kernel Mailing List, Peter Zijlstra, Martin Schwidefsky, Heiko Carstens On Thu, Jan 09, 2014 at 03:04:47PM +0100, Oleg Nesterov wrote: > > > #ifdef CONFIG_TRANSPARENT_HUGEPAGE > > page_head = page; > > if (unlikely(PageTail(page))) { > > put_page(page); > > > > > > so I'm still not seeing how a tail page racing with a split ends up with > > mayhem. > > But get/put(page_tail) plays with page_head which can be freed/reallocated, > it does compound_lock(page_head). > > > I could also still be stuck in a "la la la, everything is fine" mode. > > More likely it is me who tries to deny the fact I missed something ;) > My hangup was that this was related to futex and I was focusing it as a specific example that made the patch necessary. However, this is a therotical case that potentially impacts a put_page if it mistakenly believes it is still a tail page when it's not due a a parallel split. I see and understand that race and while I think the patch is overkill, I have no problem with including it at the start of a series that reexamines the locking in that area. It makes for a suitable -stable backport and I hope/expect the reworked locking would then remove the barrier again for upstream. I haven't looked at the reworked locking but understand there is a v3 on the way so I'll wait until that happens and work my way through it. Thanks and sorry for the noise. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 0/1] mm: fix the theoretical compound_lock() vs prep_new_page() race 2013-12-19 19:08 ` [PATCH 0/1] mm: fix the theoretical compound_lock() vs prep_new_page() race Oleg Nesterov 2013-12-19 19:09 ` [PATCH 1/1] " Oleg Nesterov @ 2013-12-20 14:19 ` Martin Schwidefsky 1 sibling, 0 replies; 114+ messages in thread From: Martin Schwidefsky @ 2013-12-20 14:19 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrea Arcangeli, Thomas Gleixner, Linus Torvalds, Dave Jones, Darren Hart, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman, Heiko Carstens On Thu, 19 Dec 2013 20:08:46 +0100 Oleg Nesterov <oleg@redhat.com> wrote: > On 12/16, Andrea Arcangeli wrote: > > > > Can you reorder set_page_refcount in your v2? > > Please see the patch. > > > I wonder if arch_alloc_page needs refcount 1, it sets the page as > > stable on s390. > > Obviously I have no idea what set_page_stable() does, but it works > with page_to_phys(), unlikely the content of "struct page" can matter. > And only s390 HAVE_ARCH_ALLOC_PAGE, I added Martin and Heiko. On s390 the arch_alloc_page primitive is used to tell the hipervisor that a page is going to be used. While the page is free it is marked as "unused" which allows the hipervisor to throw away the page content if the page is selected to be swapped. We do have a patch to add the host support for KVM somewhere in our queue. The content of the "struct page" does not matter at all for the set-stable/set-unused state transition, s390 does not care about the refcount in its arch_alloc_page function. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. ^ permalink raw reply [flat|nested] 114+ messages in thread
* [PATCH 0/2] mm: thp: get_huge_page_tail() cleanups 2013-12-13 17:34 ` Andrea Arcangeli 2013-12-16 18:36 ` Oleg Nesterov @ 2013-12-16 20:19 ` Oleg Nesterov 2013-12-16 20:19 ` [PATCH -mm 1/2] mm: thp: __get_page_tail_foll() can use get_huge_page_tail() Oleg Nesterov ` (2 more replies) 1 sibling, 3 replies; 114+ messages in thread From: Oleg Nesterov @ 2013-12-16 20:19 UTC (permalink / raw) To: Andrea Arcangeli, Andrew Morton Cc: Thomas Gleixner, Linus Torvalds, Dave Jones, Darren Hart, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman On top of mm-tail-page-refcounting-optimization-for-slab-and-hugetlbfs.patch should not be applied without the ack from Andrea. On 12/13, Andrea Arcangeli wrote: > > The above diff looks a straightforward cleanup you could submit it as > a separate patch in a v2 series. OK, let me send this separately, because (afaics) put_compound_tail() needs more thinking. See also 2/2. Again, I won't argue if you dislike this change even if it is correct, so please review and ack/nack. To me compound_head() in get_huge_page_tail() looks confusing, as if get_huge_page_tail() can accept a !PageTail page. But perhaps this is only because I am new to this code. Oleg. ^ permalink raw reply [flat|nested] 114+ messages in thread
* [PATCH -mm 1/2] mm: thp: __get_page_tail_foll() can use get_huge_page_tail() 2013-12-16 20:19 ` [PATCH 0/2] mm: thp: get_huge_page_tail() cleanups Oleg Nesterov @ 2013-12-16 20:19 ` Oleg Nesterov 2013-12-16 20:19 ` [PATCH -mm 2/2] mm: thp: turn compound_head() into BUG_ON(!PageTail) in get_huge_page_tail() Oleg Nesterov 2013-12-16 20:27 ` [PATCH 0/2] mm: thp: get_huge_page_tail() cleanups Andrea Arcangeli 2 siblings, 0 replies; 114+ messages in thread From: Oleg Nesterov @ 2013-12-16 20:19 UTC (permalink / raw) To: Andrea Arcangeli, Andrew Morton Cc: Thomas Gleixner, Linus Torvalds, Dave Jones, Darren Hart, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman Cleanup. Change __get_page_tail_foll() to use get_huge_page_tail() to avoid the code duplication. Signed-off-by: Oleg Nesterov <oleg@redhat.com> --- mm/internal.h | 5 +---- 1 files changed, 1 insertions(+), 4 deletions(-) diff --git a/mm/internal.h b/mm/internal.h index a85a3ab..a346ba1 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -47,12 +47,9 @@ static inline void __get_page_tail_foll(struct page *page, * page_cache_get_speculative()) on tail pages. */ VM_BUG_ON(atomic_read(&page->first_page->_count) <= 0); - VM_BUG_ON(atomic_read(&page->_count) != 0); - VM_BUG_ON(page_mapcount(page) < 0); if (get_page_head) atomic_inc(&page->first_page->_count); - if (compound_tail_refcounted(page->first_page)) - atomic_inc(&page->_mapcount); + get_huge_page_tail(page); } /* -- 1.5.5.1 ^ permalink raw reply related [flat|nested] 114+ messages in thread
* [PATCH -mm 2/2] mm: thp: turn compound_head() into BUG_ON(!PageTail) in get_huge_page_tail() 2013-12-16 20:19 ` [PATCH 0/2] mm: thp: get_huge_page_tail() cleanups Oleg Nesterov 2013-12-16 20:19 ` [PATCH -mm 1/2] mm: thp: __get_page_tail_foll() can use get_huge_page_tail() Oleg Nesterov @ 2013-12-16 20:19 ` Oleg Nesterov 2013-12-16 20:27 ` [PATCH 0/2] mm: thp: get_huge_page_tail() cleanups Andrea Arcangeli 2 siblings, 0 replies; 114+ messages in thread From: Oleg Nesterov @ 2013-12-16 20:19 UTC (permalink / raw) To: Andrea Arcangeli, Andrew Morton Cc: Thomas Gleixner, Linus Torvalds, Dave Jones, Darren Hart, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman get_huge_page_tail()->compound_head() looks confusing. Every caller must check PageTail(page), otherwise atomic_inc(&page->_mapcount) is simply wrong if this page is compound-trans-head. Signed-off-by: Oleg Nesterov <oleg@redhat.com> --- include/linux/mm.h | 7 +++---- 1 files changed, 3 insertions(+), 4 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 13bae9e..13495bd 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -431,13 +431,12 @@ static inline bool compound_tail_refcounted(struct page *page) static inline void get_huge_page_tail(struct page *page) { /* - * __split_huge_page_refcount() cannot run - * from under us. - * In turn no need of compound_trans_head here. + * __split_huge_page_refcount() cannot run from under us. */ + VM_BUG_ON(!PageTail(page)); VM_BUG_ON(page_mapcount(page) < 0); VM_BUG_ON(atomic_read(&page->_count) != 0); - if (compound_tail_refcounted(compound_head(page))) + if (compound_tail_refcounted(page->first_page)) atomic_inc(&page->_mapcount); } -- 1.5.5.1 ^ permalink raw reply related [flat|nested] 114+ messages in thread
* Re: [PATCH 0/2] mm: thp: get_huge_page_tail() cleanups 2013-12-16 20:19 ` [PATCH 0/2] mm: thp: get_huge_page_tail() cleanups Oleg Nesterov 2013-12-16 20:19 ` [PATCH -mm 1/2] mm: thp: __get_page_tail_foll() can use get_huge_page_tail() Oleg Nesterov 2013-12-16 20:19 ` [PATCH -mm 2/2] mm: thp: turn compound_head() into BUG_ON(!PageTail) in get_huge_page_tail() Oleg Nesterov @ 2013-12-16 20:27 ` Andrea Arcangeli 2 siblings, 0 replies; 114+ messages in thread From: Andrea Arcangeli @ 2013-12-16 20:27 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, Thomas Gleixner, Linus Torvalds, Dave Jones, Darren Hart, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman On Mon, Dec 16, 2013 at 09:19:00PM +0100, Oleg Nesterov wrote: > On top of > > mm-tail-page-refcounting-optimization-for-slab-and-hugetlbfs.patch > > should not be applied without the ack from Andrea. > > On 12/13, Andrea Arcangeli wrote: > > > > The above diff looks a straightforward cleanup you could submit it as > > a separate patch in a v2 series. > > OK, let me send this separately, because (afaics) put_compound_tail() > needs more thinking. > > See also 2/2. Again, I won't argue if you dislike this change even if > it is correct, so please review and ack/nack. To me compound_head() in > get_huge_page_tail() looks confusing, as if get_huge_page_tail() can > accept a !PageTail page. But perhaps this is only because I am new to > this code. compound_head in get_huge_page_tail was just a more readable version of page->first_page in __get_page_tail_foll. But page->first_page is faster so it's better. I reviewed all callers and there's no risk of the VM_BUG_ON triggering but I prefer it too. Both patches Acked. Acked-by: Andrea Arcangeli <aarcange@redhat.com> ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: process 'stuck' at exit. 2013-12-10 22:19 ` Linus Torvalds 2013-12-10 22:33 ` Linus Torvalds @ 2013-12-10 22:42 ` Thomas Gleixner 2013-12-10 22:48 ` Linus Torvalds ` (3 more replies) 1 sibling, 4 replies; 114+ messages in thread From: Thomas Gleixner @ 2013-12-10 22:42 UTC (permalink / raw) To: Linus Torvalds Cc: Dave Jones, Oleg Nesterov, Darren Hart, Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman On Tue, 10 Dec 2013, Linus Torvalds wrote: > On Tue, Dec 10, 2013 at 1:57 PM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > So it looks like __get_user_pages_fast() fails, and keeps failing. > > Hmm.. Is any of the addresses unchecked, perhaps? > __get_user_pages_fast() does an access_ok() check, while > get_user_pages_fast() does *not* seem to do one. > > That looks a bit dangerous. Yeah, users should have checked the > address range, but there really is no reason not to do it in > get_user_pages_fast(). > > And it looks like the futex code is actually seriously buggered. It > only does the access_ok() check for the non-shared case. > > Why? The !fshared case is the fast path which does not even reach get_user_pages_fast(). We had this discussion some time ago already, where the access_ok() check was missing in the !fshared case or the check was buggered for some reason. Need to dig up the gory details. And yes, I remember that we do not do an extra check for the fshared case, because get_user_pages_fast() should do it for us already. If not we are fubared not only in the futex code. But there is a subtle detail: err = get_user_pages_fast(address, 1, 1, &page); So we ask for write access as the write argument is 1. In case that fails we have that fallback path: /* * If write access is not required (eg. FUTEX_WAIT), try * and get read-only access. */ if (err == -EFAULT && rw == VERIFY_READ) { err = get_user_pages_fast(address, 1, 0, &page); That's a legitimate use case. And futex_requeue only requests VERIFY_READ for the !requeue_pi case. Now, if that map is RO, i.e. we took the fallback path then the THP one will fail as it has write=1 unconditionally. if (likely(__get_user_pages_fast(address, 1, 1, &page) == 1)) Thanks, tglx ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: process 'stuck' at exit. 2013-12-10 22:42 ` process 'stuck' at exit Thomas Gleixner @ 2013-12-10 22:48 ` Linus Torvalds 2013-12-10 22:58 ` Darren Hart ` (3 more replies) 2013-12-10 22:51 ` Darren Hart ` (2 subsequent siblings) 3 siblings, 4 replies; 114+ messages in thread From: Linus Torvalds @ 2013-12-10 22:48 UTC (permalink / raw) To: Thomas Gleixner Cc: Dave Jones, Oleg Nesterov, Darren Hart, Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman On Tue, Dec 10, 2013 at 2:42 PM, Thomas Gleixner <tglx@linutronix.de> wrote: > > And yes, I remember that we do not do an extra check for the fshared > case, because get_user_pages_fast() should do it for us already. If > not we are fubared not only in the futex code. Yeah. It turns out we do do the access check indirectly - by looking at the PAGE_USER bit, even if we don't necessarily check the actual limits. So get_user_pages_fast() is fine. > But there is a subtle detail: Yup, see my email from ten minutes ago, we found the same thing. And that would seem to explain the endless loop, and also the timing (since Dave mentions he started doing large-pages lately). So I think the "__get_user_pages_fast(address, 1, !ro, &page)" thing should work. Dave, can you re-create that trinity run and test that patch? I think we've got this, but it might be nice to leave the hung machine up and running until it's verified.. Although I don't really see what else we could need or get out of it, so.. Linus ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: process 'stuck' at exit. 2013-12-10 22:48 ` Linus Torvalds @ 2013-12-10 22:58 ` Darren Hart 2013-12-10 23:01 ` Dave Jones 2013-12-10 23:00 ` Dave Jones ` (2 subsequent siblings) 3 siblings, 1 reply; 114+ messages in thread From: Darren Hart @ 2013-12-10 22:58 UTC (permalink / raw) To: Linus Torvalds Cc: Thomas Gleixner, Dave Jones, Oleg Nesterov, Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman On Tue, 2013-12-10 at 14:48 -0800, Linus Torvalds wrote: > On Tue, Dec 10, 2013 at 2:42 PM, Thomas Gleixner <tglx@linutronix.de> wrote: > > > > And yes, I remember that we do not do an extra check for the fshared > > case, because get_user_pages_fast() should do it for us already. If > > not we are fubared not only in the futex code. > > Yeah. It turns out we do do the access check indirectly - by looking > at the PAGE_USER bit, even if we don't necessarily check the actual > limits. So get_user_pages_fast() is fine. > > > But there is a subtle detail: > > Yup, see my email from ten minutes ago, we found the same thing. And > that would seem to explain the endless loop, and also the timing > (since Dave mentions he started doing large-pages lately). > > So I think the "__get_user_pages_fast(address, 1, !ro, &page)" thing > should work. > > Dave, can you re-create that trinity run and test that patch? I think > we've got this, but it might be nice to leave the hung machine up and > running until it's verified.. Although I don't really see what else we > could need or get out of it, so.. Would it be possible to limit the options to only pass FUTEX_CMP_REQUEUE and a read-only uaddr? That should improve confidence when it doesn't fail :-) -- Darren Hart Intel Open Source Technology Center Yocto Project - Linux Kernel ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: process 'stuck' at exit. 2013-12-10 22:58 ` Darren Hart @ 2013-12-10 23:01 ` Dave Jones 0 siblings, 0 replies; 114+ messages in thread From: Dave Jones @ 2013-12-10 23:01 UTC (permalink / raw) To: Darren Hart Cc: Linus Torvalds, Thomas Gleixner, Oleg Nesterov, Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman On Tue, Dec 10, 2013 at 02:58:19PM -0800, Darren Hart wrote: > On Tue, 2013-12-10 at 14:48 -0800, Linus Torvalds wrote: > > On Tue, Dec 10, 2013 at 2:42 PM, Thomas Gleixner <tglx@linutronix.de> wrote: > > > > > > And yes, I remember that we do not do an extra check for the fshared > > > case, because get_user_pages_fast() should do it for us already. If > > > not we are fubared not only in the futex code. > > > > Yeah. It turns out we do do the access check indirectly - by looking > > at the PAGE_USER bit, even if we don't necessarily check the actual > > limits. So get_user_pages_fast() is fine. > > > > > But there is a subtle detail: > > > > Yup, see my email from ten minutes ago, we found the same thing. And > > that would seem to explain the endless loop, and also the timing > > (since Dave mentions he started doing large-pages lately). > > > > So I think the "__get_user_pages_fast(address, 1, !ro, &page)" thing > > should work. > > > > Dave, can you re-create that trinity run and test that patch? I think > > we've got this, but it might be nice to leave the hung machine up and > > running until it's verified.. Although I don't really see what else we > > could need or get out of it, so.. > > Would it be possible to limit the options to only pass FUTEX_CMP_REQUEUE > and a read-only uaddr? That should improve confidence when it doesn't > fail :-) easy enough to hack into the code yeah. A bit complicated to come up with a sensible grammar for a command line parser for such cases sadly. I'll give the patch a try after dinner. Dave ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: process 'stuck' at exit. 2013-12-10 22:48 ` Linus Torvalds 2013-12-10 22:58 ` Darren Hart @ 2013-12-10 23:00 ` Dave Jones 2013-12-11 0:05 ` Steven Rostedt 2013-12-11 4:09 ` Dave Jones 2013-12-12 4:26 ` Dave Jones 3 siblings, 1 reply; 114+ messages in thread From: Dave Jones @ 2013-12-10 23:00 UTC (permalink / raw) To: Linus Torvalds Cc: Thomas Gleixner, Oleg Nesterov, Darren Hart, Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman On Tue, Dec 10, 2013 at 02:48:52PM -0800, Linus Torvalds wrote: > On Tue, Dec 10, 2013 at 2:42 PM, Thomas Gleixner <tglx@linutronix.de> wrote: > > > > And yes, I remember that we do not do an extra check for the fshared > > case, because get_user_pages_fast() should do it for us already. If > > not we are fubared not only in the futex code. > > Yeah. It turns out we do do the access check indirectly - by looking > at the PAGE_USER bit, even if we don't necessarily check the actual > limits. So get_user_pages_fast() is fine. > > > But there is a subtle detail: > > Yup, see my email from ten minutes ago, we found the same thing. And > that would seem to explain the endless loop, and also the timing > (since Dave mentions he started doing large-pages lately). > > So I think the "__get_user_pages_fast(address, 1, !ro, &page)" thing > should work. > > Dave, can you re-create that trinity run and test that patch? I think > we've got this, but it might be nice to leave the hung machine up and > running until it's verified.. Although I don't really see what else we > could need or get out of it, so.. The only thing I'm still unclear on, is how that pid allegedly wasn't doing a futex call as part of its run. The only thing I can think of is that the other pid that _did_ do a futex call did it on a page that was MAP_SHARED between all the other children, and this 'spin forever' thing only happens when the last process with a reference on that page exits ? does that make sense? Dave ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: process 'stuck' at exit. 2013-12-10 23:00 ` Dave Jones @ 2013-12-11 0:05 ` Steven Rostedt 2013-12-11 0:23 ` Dave Jones 0 siblings, 1 reply; 114+ messages in thread From: Steven Rostedt @ 2013-12-11 0:05 UTC (permalink / raw) To: Dave Jones, Linus Torvalds, Thomas Gleixner, Oleg Nesterov, Darren Hart, Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman On Tue, Dec 10, 2013 at 06:00:09PM -0500, Dave Jones wrote: > > The only thing I'm still unclear on, is how that pid allegedly wasn't doing > a futex call as part of its run. The only thing I can think of is that > the other pid that _did_ do a futex call did it on a page that was MAP_SHARED > between all the other children, and this 'spin forever' thing only > happens when the last process with a reference on that page exits ? Which thread did not do the futex call? The one that was spinning? No, that one most definitely was, at least according to the stack trace trace you posted: trinity-child27-10818 [001] 89790.703547: kernel_stack: <stack trace> => futex_requeue (ffffffff810df18a) => do_futex (ffffffff810e019e) => SyS_futex (ffffffff810e0de1) => tracesys (ffffffff81760be4) It did a futex() system call. Or are you talking about another thread? -- Steve ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: process 'stuck' at exit. 2013-12-11 0:05 ` Steven Rostedt @ 2013-12-11 0:23 ` Dave Jones 2013-12-11 0:55 ` Dave Jones 0 siblings, 1 reply; 114+ messages in thread From: Dave Jones @ 2013-12-11 0:23 UTC (permalink / raw) To: Steven Rostedt Cc: Linus Torvalds, Thomas Gleixner, Oleg Nesterov, Darren Hart, Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman On Tue, Dec 10, 2013 at 07:05:04PM -0500, Steven Rostedt wrote: > On Tue, Dec 10, 2013 at 06:00:09PM -0500, Dave Jones wrote: > > > > The only thing I'm still unclear on, is how that pid allegedly wasn't doing > > a futex call as part of its run. The only thing I can think of is that > > the other pid that _did_ do a futex call did it on a page that was MAP_SHARED > > between all the other children, and this 'spin forever' thing only > > happens when the last process with a reference on that page exits ? > > Which thread did not do the futex call? The one that was spinning? No, that one > most definitely was, at least according to the stack trace trace you posted: > > trinity-child27-10818 [001] 89790.703547: kernel_stack: <stack trace> > => futex_requeue (ffffffff810df18a) > => do_futex (ffffffff810e019e) > => SyS_futex (ffffffff810e0de1) > => tracesys (ffffffff81760be4) > > It did a futex() system call. > > Or are you talking about another thread? It's the same thread. but here's what it says the last thing it did was.. (gdb) print shm->previous_syscallno[27] $1 = 288 accept4. Just to verify I'm looking at the right array member.. (gdb) print shm->pids[27] $2 = 10818 Oh, hmm. Wait, I'm an idiot. I only update ->previous when we come back from the syscall. It's _still_ doing this syscall. (gdb) print shm->syscallno[27] $4 = 202 I was distracted by seeing all the other threads exiting, so I was only looking at what this one had already done. ok, mystery solved. derp. Dave ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: process 'stuck' at exit. 2013-12-11 0:23 ` Dave Jones @ 2013-12-11 0:55 ` Dave Jones 2013-12-14 20:17 ` Oleg Nesterov 0 siblings, 1 reply; 114+ messages in thread From: Dave Jones @ 2013-12-11 0:55 UTC (permalink / raw) To: Steven Rostedt, Linus Torvalds, Thomas Gleixner, Oleg Nesterov, Darren Hart, Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman On Tue, Dec 10, 2013 at 07:23:30PM -0500, Dave Jones wrote: > I was distracted by seeing all the other threads exiting, so I was only looking at > what this one had already done. another thing that distracted me was that /proc/10818/stack was just showing that [<ffffffffffffffff>] 0xffffffffffffffff output. For my own education, what causes that ? How come it didn't show the same trace I saw when I sysrq-t'd ? Dave ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: process 'stuck' at exit. 2013-12-11 0:55 ` Dave Jones @ 2013-12-14 20:17 ` Oleg Nesterov 0 siblings, 0 replies; 114+ messages in thread From: Oleg Nesterov @ 2013-12-14 20:17 UTC (permalink / raw) To: Dave Jones, Steven Rostedt, Linus Torvalds, Thomas Gleixner, Darren Hart, Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman On 12/10, Dave Jones wrote: > > On Tue, Dec 10, 2013 at 07:23:30PM -0500, Dave Jones wrote: > > > I was distracted by seeing all the other threads exiting, so I was only looking at > > what this one had already done. > > another thing that distracted me was that /proc/10818/stack was just showing that > > [<ffffffffffffffff>] 0xffffffffffffffff > > output. > > For my own education, what causes that ? save_stack_trace_tsk() adds ULONG_MAX as the "last" entry. and dump_trace() fails if task is running and != current (note that cat /proc/self/stack works). > How come it didn't show the same trace I saw when I sysrq-t'd ? Because print_trace_address() does not skip !reliable entries, unlike __save_stack_address(). This (afaics) makes the difference. I'll try to make a patch but I am not sure... I am not even sure it makes sense, but in any case this all doesn't look right to me. First of all, stack = task->thread.sp is not really right if this task is running. Worse, bp = *stack returned by stack_frame() is random in this case. This equally applies to sysrq-t's output. Not that bad, but still wrong and confusing, imho. And lets look at dump_trace(), const unsigned cpu = get_cpu(); unsigned long *irq_stack_end = (unsigned long *)per_cpu(irq_stack_ptr, cpu); This (in general) has nothing to do with task_cpu(task). And why dump_trace() checks irq_stack_end != NULL ? This is always true. I think that these paths should not even try to guess what bp is if the task is not running/current. But it is not clear to "disable" reliable check in __save_stack_address(), we should report this fact in proc_pid_stack()->seq_printf() somehow. And proc_pid_stack() should drop ->cred_guard_mutex right after save_stack_trace_tsk(), although this is off-topic. Oleg. ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: process 'stuck' at exit. 2013-12-10 22:48 ` Linus Torvalds 2013-12-10 22:58 ` Darren Hart 2013-12-10 23:00 ` Dave Jones @ 2013-12-11 4:09 ` Dave Jones 2013-12-12 4:26 ` Dave Jones 3 siblings, 0 replies; 114+ messages in thread From: Dave Jones @ 2013-12-11 4:09 UTC (permalink / raw) To: Linus Torvalds Cc: Thomas Gleixner, Oleg Nesterov, Darren Hart, Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman On Tue, Dec 10, 2013 at 02:48:52PM -0800, Linus Torvalds wrote: > Dave, can you re-create that trinity run and test that patch? Looks ok so far, but I'll leave it run overnight to be sure. Dave ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: process 'stuck' at exit. 2013-12-10 22:48 ` Linus Torvalds ` (2 preceding siblings ...) 2013-12-11 4:09 ` Dave Jones @ 2013-12-12 4:26 ` Dave Jones 2013-12-12 5:29 ` Darren Hart 3 siblings, 1 reply; 114+ messages in thread From: Dave Jones @ 2013-12-12 4:26 UTC (permalink / raw) To: Linus Torvalds Cc: Thomas Gleixner, Oleg Nesterov, Darren Hart, Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman On Tue, Dec 10, 2013 at 02:48:52PM -0800, Linus Torvalds wrote: > Dave, can you re-create that trinity run and test that patch? I think > we've got this 24 hours later, all is well. I think we can call this one done. Tested-by: Dave Jones <davej@fedoraproject.org> Dave ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: process 'stuck' at exit. 2013-12-12 4:26 ` Dave Jones @ 2013-12-12 5:29 ` Darren Hart 0 siblings, 0 replies; 114+ messages in thread From: Darren Hart @ 2013-12-12 5:29 UTC (permalink / raw) To: Dave Jones Cc: Linus Torvalds, Thomas Gleixner, Oleg Nesterov, Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman On Wed, 2013-12-11 at 23:26 -0500, Dave Jones wrote: > On Tue, Dec 10, 2013 at 02:48:52PM -0800, Linus Torvalds wrote: > > > Dave, can you re-create that trinity run and test that patch? I think > > we've got this > > 24 hours later, all is well. I think we can call this one done. > > Tested-by: Dave Jones <davej@fedoraproject.org> Thank you again for a fine preemptive bug catch Dave! -- Darren Hart Intel Open Source Technology Center Yocto Project - Linux Kernel ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: process 'stuck' at exit. 2013-12-10 22:42 ` process 'stuck' at exit Thomas Gleixner 2013-12-10 22:48 ` Linus Torvalds @ 2013-12-10 22:51 ` Darren Hart 2013-12-10 23:24 ` Al Viro 2013-12-11 16:31 ` Mel Gorman 3 siblings, 0 replies; 114+ messages in thread From: Darren Hart @ 2013-12-10 22:51 UTC (permalink / raw) To: Thomas Gleixner Cc: Linus Torvalds, Dave Jones, Oleg Nesterov, Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman On Tue, 2013-12-10 at 23:42 +0100, Thomas Gleixner wrote: > On Tue, 10 Dec 2013, Linus Torvalds wrote: > > > On Tue, Dec 10, 2013 at 1:57 PM, Linus Torvalds > > <torvalds@linux-foundation.org> wrote: > > > > > > So it looks like __get_user_pages_fast() fails, and keeps failing. > > > > Hmm.. Is any of the addresses unchecked, perhaps? > > __get_user_pages_fast() does an access_ok() check, while > > get_user_pages_fast() does *not* seem to do one. > > > > That looks a bit dangerous. Yeah, users should have checked the > > address range, but there really is no reason not to do it in > > get_user_pages_fast(). > > > > And it looks like the futex code is actually seriously buggered. It > > only does the access_ok() check for the non-shared case. > > > > Why? > > The !fshared case is the fast path which does not even reach > get_user_pages_fast(). > > We had this discussion some time ago already, where the access_ok() > check was missing in the !fshared case or the check was buggered for > some reason. Need to dig up the gory details. > > And yes, I remember that we do not do an extra check for the fshared > case, because get_user_pages_fast() should do it for us already. If > not we are fubared not only in the futex code. > > But there is a subtle detail: > > err = get_user_pages_fast(address, 1, 1, &page); > > So we ask for write access as the write argument is 1. In case that > fails we have that fallback path: > > /* > * If write access is not required (eg. FUTEX_WAIT), try > * and get read-only access. > */ > if (err == -EFAULT && rw == VERIFY_READ) { > err = get_user_pages_fast(address, 1, 0, &page); > > That's a legitimate use case. And futex_requeue only requests > VERIFY_READ for the !requeue_pi case. > > Now, if that map is RO, i.e. we took the fallback path then the THP > one will fail as it has write=1 unconditionally. > > if (likely(__get_user_pages_fast(address, 1, 1, &page) == 1)) > Is there a reason THP requires unconditional rw? Andrea? Or is the following actually the answer here? diff --git a/kernel/futex.c b/kernel/futex.c index 80ba086..02febad 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -288,7 +288,7 @@ again: put_page(page); /* serialize against __split_huge_page_splitting() */ local_irq_disable(); - if (likely(__get_user_pages_fast(address, 1, 1, &page) == 1)) { + if (likely(__get_user_pages_fast(address, 1, !ro, &page) == 1)) { page_head = compound_head(page); /* * page_head is valid pointer but we must pin -- Darren Hart Intel Open Source Technology Center Yocto Project - Linux Kernel ^ permalink raw reply related [flat|nested] 114+ messages in thread
* Re: process 'stuck' at exit. 2013-12-10 22:42 ` process 'stuck' at exit Thomas Gleixner 2013-12-10 22:48 ` Linus Torvalds 2013-12-10 22:51 ` Darren Hart @ 2013-12-10 23:24 ` Al Viro 2013-12-11 16:31 ` Mel Gorman 3 siblings, 0 replies; 114+ messages in thread From: Al Viro @ 2013-12-10 23:24 UTC (permalink / raw) To: Thomas Gleixner Cc: Linus Torvalds, Dave Jones, Oleg Nesterov, Darren Hart, Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman On Tue, Dec 10, 2013 at 11:42:15PM +0100, Thomas Gleixner wrote: > /* > * If write access is not required (eg. FUTEX_WAIT), try > * and get read-only access. > */ > if (err == -EFAULT && rw == VERIFY_READ) { > err = get_user_pages_fast(address, 1, 0, &page); > > That's a legitimate use case. And futex_requeue only requests > VERIFY_READ for the !requeue_pi case. > > Now, if that map is RO, i.e. we took the fallback path then the THP > one will fail as it has write=1 unconditionally. access_ok() has nothing whatsoever to do with RO vs. RW mappings. It checks whether the address is OK for userland on architectures with userland and kernel sharing the same address space (e.g. x86). On something like e.g. sparc64 or s390 it's constant 1. Note that there's nothing to stop another thread from remapping an RW area RO just as you've returned from access_ok(), so checking for writability in access_ok() would've been racy as hell. Ditto for address being mapped at all... Moreover, there are exactly two architectures that do not ignore the first argument of access_ok() - microblaze and um. The former uses it in debugging printk in failure case. The latter... AFAICS, it's pointless - it's a special dispensation for read access to host vsyscall page from guest process. The thing is, writes there are going to fail anyway - host kernel won't let the guest kernel to modify that page, period. IOW, it looks like um might as well drop the (type == VERIFY_READ) part in __access_ok_vsyscall(). Why do we have the 'type' argument of access_ok(), anyway? ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: process 'stuck' at exit. 2013-12-10 22:42 ` process 'stuck' at exit Thomas Gleixner ` (2 preceding siblings ...) 2013-12-10 23:24 ` Al Viro @ 2013-12-11 16:31 ` Mel Gorman 2013-12-11 16:38 ` Thomas Gleixner 3 siblings, 1 reply; 114+ messages in thread From: Mel Gorman @ 2013-12-11 16:31 UTC (permalink / raw) To: Thomas Gleixner Cc: Linus Torvalds, Dave Jones, Oleg Nesterov, Darren Hart, Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra On Tue, Dec 10, 2013 at 11:42:15PM +0100, Thomas Gleixner wrote: > On Tue, 10 Dec 2013, Linus Torvalds wrote: > > > On Tue, Dec 10, 2013 at 1:57 PM, Linus Torvalds > > <torvalds@linux-foundation.org> wrote: > > > > > > So it looks like __get_user_pages_fast() fails, and keeps failing. > > > > Hmm.. Is any of the addresses unchecked, perhaps? > > __get_user_pages_fast() does an access_ok() check, while > > get_user_pages_fast() does *not* seem to do one. > > > > That looks a bit dangerous. Yeah, users should have checked the > > address range, but there really is no reason not to do it in > > get_user_pages_fast(). > > > > And it looks like the futex code is actually seriously buggered. It > > only does the access_ok() check for the non-shared case. > > > > Why? > > The !fshared case is the fast path which does not even reach > get_user_pages_fast(). > > We had this discussion some time ago already, where the access_ok() > check was missing in the !fshared case or the check was buggered for > some reason. Need to dig up the gory details. > > And yes, I remember that we do not do an extra check for the fshared > case, because get_user_pages_fast() should do it for us already. If > not we are fubared not only in the futex code. > > But there is a subtle detail: > > err = get_user_pages_fast(address, 1, 1, &page); > > So we ask for write access as the write argument is 1. In case that > fails we have that fallback path: > > /* > * If write access is not required (eg. FUTEX_WAIT), try > * and get read-only access. > */ > if (err == -EFAULT && rw == VERIFY_READ) { > err = get_user_pages_fast(address, 1, 0, &page); > > That's a legitimate use case. And futex_requeue only requests > VERIFY_READ for the !requeue_pi case. > > Now, if that map is RO, i.e. we took the fallback path then the THP > one will fail as it has write=1 unconditionally. > > if (likely(__get_user_pages_fast(address, 1, 1, &page) == 1)) > Not that it really matters but the naming and comments around that particular __get_user_pages_fast call are a little misleading. The ifdef CONFIG_TRANSPARENT_HUGEPAGE in futex.c is there because greater care has to be taken against THP splits, not because it is dealing exclusively with THP. The PageTail check applies to either hugetlbfs or THPs and similarly gup_huge_pmd handles both. The whole path should be very rare for THPs considering that THPs exist on MAP_PRIVATE anonymous mappings and how many shared futexes exist backed by such mappings? A RO mapping makes that seem even more strange because what thread is updating the value the caller is waiting on? It seems more like that was a shared futex on a hugetlbfs-backed mappings which might explain why the bug was undiscovered for so long. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: process 'stuck' at exit. 2013-12-11 16:31 ` Mel Gorman @ 2013-12-11 16:38 ` Thomas Gleixner 2013-12-11 17:57 ` Mel Gorman 0 siblings, 1 reply; 114+ messages in thread From: Thomas Gleixner @ 2013-12-11 16:38 UTC (permalink / raw) To: Mel Gorman Cc: Linus Torvalds, Dave Jones, Oleg Nesterov, Darren Hart, Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra On Wed, 11 Dec 2013, Mel Gorman wrote: > On Tue, Dec 10, 2013 at 11:42:15PM +0100, Thomas Gleixner wrote: > > Now, if that map is RO, i.e. we took the fallback path then the THP > > one will fail as it has write=1 unconditionally. > > > > if (likely(__get_user_pages_fast(address, 1, 1, &page) == 1)) > > > > Not that it really matters but the naming and comments around that > particular __get_user_pages_fast call are a little misleading. The ifdef > CONFIG_TRANSPARENT_HUGEPAGE in futex.c is there because greater care has > to be taken against THP splits, not because it is dealing exclusively with > THP. The PageTail check applies to either hugetlbfs or THPs and similarly > gup_huge_pmd handles both. The whole path should be very rare for THPs > considering that THPs exist on MAP_PRIVATE anonymous mappings and how many > shared futexes exist backed by such mappings? A RO mapping makes that seem > even more strange because what thread is updating the value the caller is > waiting on? It seems more like that was a shared futex on a hugetlbfs-backed > mappings which might explain why the bug was undiscovered for so long. This is the fshared path. The MAP_PRIVATE path does not do that dance at all. Thanks, tglx ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: process 'stuck' at exit. 2013-12-11 16:38 ` Thomas Gleixner @ 2013-12-11 17:57 ` Mel Gorman 0 siblings, 0 replies; 114+ messages in thread From: Mel Gorman @ 2013-12-11 17:57 UTC (permalink / raw) To: Thomas Gleixner Cc: Linus Torvalds, Dave Jones, Oleg Nesterov, Darren Hart, Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra On Wed, Dec 11, 2013 at 05:38:55PM +0100, Thomas Gleixner wrote: > On Wed, 11 Dec 2013, Mel Gorman wrote: > > On Tue, Dec 10, 2013 at 11:42:15PM +0100, Thomas Gleixner wrote: > > > Now, if that map is RO, i.e. we took the fallback path then the THP > > > one will fail as it has write=1 unconditionally. > > > > > > if (likely(__get_user_pages_fast(address, 1, 1, &page) == 1)) > > > > > > > Not that it really matters but the naming and comments around that > > particular __get_user_pages_fast call are a little misleading. The ifdef > > CONFIG_TRANSPARENT_HUGEPAGE in futex.c is there because greater care has > > to be taken against THP splits, not because it is dealing exclusively with > > THP. The PageTail check applies to either hugetlbfs or THPs and similarly > > gup_huge_pmd handles both. The whole path should be very rare for THPs > > considering that THPs exist on MAP_PRIVATE anonymous mappings and how many > > shared futexes exist backed by such mappings? A RO mapping makes that seem > > even more strange because what thread is updating the value the caller is > > waiting on? It seems more like that was a shared futex on a hugetlbfs-backed > > mappings which might explain why the bug was undiscovered for so long. > > This is the fshared path. The MAP_PRIVATE path does not do that dance > at all. > do_futex takes an op parameter and sets FLAGS_SHARED on that op parameter, not whether the VMA backing that address was created with the MAP_PRIVATE or MAP_SHARED flag. For example; do_futex(addr, op) where !(ops & FUTEX_PRIVATE_FLAG) and cmd == FUTEX_WAIT -> futex_wait -> futex_wait_setup -> get_futex_key(fshared == true) THP pages encountered in the fshared path should be rare because why create a shared futex on a private mapping? That does not make it impossible which is why splits are guarded against. If it was genuinely impossible to be in this path for MAP_PRIVATE then there is no point worrying about THP as it is currently supported at least. Overall, it still seems more likely this was a hugetlbfs page. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: process 'stuck' at exit. 2013-12-10 21:57 ` Linus Torvalds ` (2 preceding siblings ...) 2013-12-10 22:19 ` Linus Torvalds @ 2013-12-12 19:00 ` Andrea Arcangeli 2013-12-12 19:03 ` Linus Torvalds 3 siblings, 1 reply; 114+ messages in thread From: Andrea Arcangeli @ 2013-12-12 19:00 UTC (permalink / raw) To: Linus Torvalds Cc: Dave Jones, Oleg Nesterov, Thomas Gleixner, Darren Hart, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman Hello, On Tue, Dec 10, 2013 at 01:57:49PM -0800, Linus Torvalds wrote: > On Tue, Dec 10, 2013 at 1:41 PM, Dave Jones <davej@redhat.com> wrote: > > > > http://codemonkey.org.uk/junk/trace > > Hmm. Ok, so something is calling [__]get_user_pages_fast() and > put_page() in a loop, but the trace doesn't show what that "something" > is, because it is itself not ever called. > > However, that pattern does seem to imply that the loop is in > get_futex_key(), because all the other loops I see seem to be calling > other things as well. > > And the __get_user_pages_fast() call implies that it's the THP case > that triggers the "unlikely(PageTail(page))" case. And anyway, > otherwise we'd see lock_page()/unlock_page() too. > > So it looks like __get_user_pages_fast() fails, and keeps failing. > Andrea, this is your code, any ideas? Commit a5b338f2b0b1f ("thp: > update futex compound knowledge") to be exact. I can only agree that the __get_user_pages_fast "write" parameter should have been !ro instead of 1. However it wasn't me introducing the bug, my code when patched in early 2011 would work fine. The bug was introduced half a year later in commit 9ea71503a8ed9184d2d0b8ccc4d269d05f7940ae . @@ -262,8 +264,18 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key) again: err = get_user_pages_fast(address, 1, 1, &page); + /* + * If write access is not required (eg. FUTEX_WAIT), try + * and get read-only access. + */ + if (err == -EFAULT && rw == VERIFY_READ) { + err = get_user_pages_fast(address, 1, 0, &page); + ro = 1; + } if (err < 0) return err; + else + err = 0; #ifdef CONFIG_TRANSPARENT_HUGEPAGE page_head = page; @@ -305,6 +317,13 @@ again: if (!page_head->mapping) { unlock_page(page_head); put_page(page_head); + /* + * ZERO_PAGE pages don't have a mapping. Avoid a busy loop + * trying to find one. RW mapping would have COW'd (and thus + * have a mapping) so this page is RO and won't ever change. + */ + if ((page_head == ZERO_PAGE(address))) + return -EFAULT; goto again; } [..] This commit didn't update the __get_user_pages_fast to use !ro instead of 1, as it would have been required after the above change. I'll now review Oleg proposed cleanup. Thanks for tracking this down! Andrea ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: process 'stuck' at exit. 2013-12-12 19:00 ` Andrea Arcangeli @ 2013-12-12 19:03 ` Linus Torvalds 0 siblings, 0 replies; 114+ messages in thread From: Linus Torvalds @ 2013-12-12 19:03 UTC (permalink / raw) To: Andrea Arcangeli Cc: Dave Jones, Oleg Nesterov, Thomas Gleixner, Darren Hart, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman On Thu, Dec 12, 2013 at 11:00 AM, Andrea Arcangeli <aarcange@redhat.com> wrote: > > However it wasn't me introducing the bug, my code when patched in > early 2011 would work fine. The bug was introduced half a year later > in commit 9ea71503a8ed9184d2d0b8ccc4d269d05f7940ae . I'd argue that half a year later the bug was *fixed*, but it was only fixed for the regular pages. Leaving largepages buggy. Linus ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: process 'stuck' at exit. 2013-12-10 21:41 ` Dave Jones 2013-12-10 21:57 ` Linus Torvalds @ 2013-12-10 22:09 ` Steven Rostedt 2013-12-10 22:16 ` Dave Jones 1 sibling, 1 reply; 114+ messages in thread From: Steven Rostedt @ 2013-12-10 22:09 UTC (permalink / raw) To: Dave Jones, Oleg Nesterov, Linus Torvalds, Thomas Gleixner, Darren Hart, Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman On Tue, Dec 10, 2013 at 04:41:43PM -0500, Dave Jones wrote: > > > > OK, thanks. So it doesn't return to user-space. > > > > could you do > > > > cd /sys/kernel/debug/tracing/ > > echo 10818 >> set_ftrace_pid > > echo function_graph >> current_tracer > > echo 1 >> tracing_on > > > > and look into "trace" file to find out how exactly it loops? > > http://codemonkey.org.uk/junk/trace Because we are already in the function that is looping, we don't see what that function is (it's never called). So you can do either: trace-cmd record -p function -l get_user_pages_fast --func-stack sleep 5 Which will trace the get_user_pages_fast and spit out a full call trace. Or if you don't want to use trace-cmd, you can do it by hand. But be warned! If you don't do this right, you can live lock the system. Or make it extremely slow. That is, you must have a filter on the functions you trace before you set the function stack trace flag. (/me needs to prevent that from happening) cd /sys/kernel/debug/tracing echo get_user_pages_fast > set_ftrace_filter cat set_ftrace_filter # make sure get_user_pages_fast is there echo function > current_tracer echo 1 > options/func_stack_trace read your trace. Either by: cat trace or trace-cmd show And then after you recorded that. echo 0 > options/func_stack_trace to make sure you don't accidently enable stack tracing on *all* functions. I haven't had that really live lock the system, but it took about two minutes to disable it again, as each key stroke took several seconds to compete. -- Steve ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: process 'stuck' at exit. 2013-12-10 22:09 ` Steven Rostedt @ 2013-12-10 22:16 ` Dave Jones 2013-12-10 22:21 ` Steven Rostedt 0 siblings, 1 reply; 114+ messages in thread From: Dave Jones @ 2013-12-10 22:16 UTC (permalink / raw) To: Steven Rostedt Cc: Oleg Nesterov, Linus Torvalds, Thomas Gleixner, Darren Hart, Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman On Tue, Dec 10, 2013 at 05:09:46PM -0500, Steven Rostedt wrote: > So you can do either: > > trace-cmd record -p function -l get_user_pages_fast --func-stack sleep 5 > > Which will trace the get_user_pages_fast and spit out a full call trace. This gives me a 100M trace.dat, but trace show shows.. # tracer: nop # # entries-in-buffer/entries-written: 0/0 #P:4 # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / delay # TASK-PID CPU# |||| TIMESTAMP FUNCTION # | | | |||| | | and that's it. > Or if you don't want to use trace-cmd, you can do it by hand. Given the state the system is in, I don't want to perturb it too much. Dave ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: process 'stuck' at exit. 2013-12-10 22:16 ` Dave Jones @ 2013-12-10 22:21 ` Steven Rostedt 2013-12-10 22:27 ` Dave Jones 0 siblings, 1 reply; 114+ messages in thread From: Steven Rostedt @ 2013-12-10 22:21 UTC (permalink / raw) To: Dave Jones, Oleg Nesterov, Linus Torvalds, Thomas Gleixner, Darren Hart, Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman On Tue, Dec 10, 2013 at 05:16:21PM -0500, Dave Jones wrote: > 46.GA6962@home.goodmis.org> > User-Agent: Mutt/1.5.21 (2010-09-15) > X-Scanned-By: MIMEDefang 2.68 on 10.5.11.23 > Sender: linux-kernel-owner@vger.kernel.org > Precedence: bulk > List-ID: <linux-kernel.vger.kernel.org> > X-Mailing-List: linux-kernel@vger.kernel.org > X-UID: 421684 > Content-Length: 1147 > Status: O > > On Tue, Dec 10, 2013 at 05:09:46PM -0500, Steven Rostedt wrote: > > > So you can do either: > > > > trace-cmd record -p function -l get_user_pages_fast --func-stack sleep 5 > > > > Which will trace the get_user_pages_fast and spit out a full call trace. > This gives me a 100M trace.dat, but trace show shows.. > > # tracer: nop > # > # entries-in-buffer/entries-written: 0/0 #P:4 > # > # _-----=> irqs-off > # / _----=> need-resched > # | / _---=> hardirq/softirq > # || / _--=> preempt-depth > # ||| / delay > # TASK-PID CPU# |||| TIMESTAMP FUNCTION > # | | | |||| | | > > and that's it. Oh, if you use trace-cmd record. You need to do trace-cmd report. The trace-cmd show, shows you what's in the trace file, which was for the "manual" version. Sorry for the confusion. -- Steve > > > Or if you don't want to use trace-cmd, you can ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: process 'stuck' at exit. 2013-12-10 22:21 ` Steven Rostedt @ 2013-12-10 22:27 ` Dave Jones 0 siblings, 0 replies; 114+ messages in thread From: Dave Jones @ 2013-12-10 22:27 UTC (permalink / raw) To: Steven Rostedt Cc: Oleg Nesterov, Linus Torvalds, Thomas Gleixner, Darren Hart, Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra, Mel Gorman On Tue, Dec 10, 2013 at 05:21:50PM -0500, Steven Rostedt wrote: > The trace-cmd show, shows you what's in the trace file, which was for the > "manual" version. > > Sorry for the confusion. ah, thanks. That shows.. version = 6 CPU 0 is empty CPU 2 is empty CPU 3 is empty cpus=4 trinity-child27-10818 [001] 89790.703544: kernel_stack: <stack trace> => futex_requeue (ffffffff810df18a) => do_futex (ffffffff810e019e) => SyS_futex (ffffffff810e0de1) => tracesys (ffffffff81760be4) trinity-child27-10818 [001] 89790.703545: function: get_user_pages_fast trinity-child27-10818 [001] 89790.703547: kernel_stack: <stack trace> => futex_requeue (ffffffff810df18a) => do_futex (ffffffff810e019e) => SyS_futex (ffffffff810e0de1) => tracesys (ffffffff81760be4) trinity-child27-10818 [001] 89790.703547: function: get_user_pages_fast trinity-child27-10818 [001] 89790.703549: kernel_stack: <stack trace> => futex_requeue (ffffffff810df18a) => do_futex (ffffffff810e019e) => SyS_futex (ffffffff810e0de1) => tracesys (ffffffff81760be4) trinity-child27-10818 [001] 89790.703550: function: get_user_pages_fast trinity-child27-10818 [001] 89790.703552: kernel_stack: <stack trace> => futex_requeue (ffffffff810df18a) => do_futex (ffffffff810e019e) => SyS_futex (ffffffff810e0de1) => tracesys (ffffffff81760be4) over and over and.. Dave ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: process 'stuck' at exit. 2013-12-10 19:18 ` Thomas Gleixner 2013-12-10 19:55 ` Linus Torvalds @ 2013-12-11 1:02 ` Mel Gorman 1 sibling, 0 replies; 114+ messages in thread From: Mel Gorman @ 2013-12-11 1:02 UTC (permalink / raw) To: Thomas Gleixner Cc: Linus Torvalds, Dave Jones, Darren Hart, Andrea Arcangeli, Linux Kernel Mailing List, Peter Zijlstra On Tue, Dec 10, 2013 at 08:18:29PM +0100, Thomas Gleixner wrote: > On Tue, 10 Dec 2013, Linus Torvalds wrote: > > > Hmm. Looks like the futex code is somehow stuck in a loop, calling > > get_user_pages_fast(). > > > > The futex code itself is apparently so low-overhead that it doesn't > > show up in your 'perf top' report (which is dominated by all the > > expensive debug things that get_user_pages_fast() etc ends up doing), > > but that's the only looping I can see. Perhaps the "goto again" case > > for transparent huge pages in get_futex_key()? Or the > > Cc'ng more folks on that. > I just saw this before heading to bed and have not read the thread. I'll read it in the morning but in the meantime the following might ring a bell for someone elses investigation or someone more familiar with how futexs work from end to end. Was NUMA balancing enabled and was this a NUMA machine? I ask because of these two patches that are currently in flight mm: numa: Serialise parallel get_user_page against THP migration mm fix TLB flush race between migration, and change_protection_range There are related patches but these two are the most important for what I have in mind. The two in combination address a problem whereby a write from one thread can be lost due to a THP migration but it's specific to automatic NUMA balancing. If the lost update was for a page containing a futex then the lost write could confuse waiters. The downside is that this is a bad fit for the problem description in the first mail. A lost update might result in processes waiting forever on a value that never changes but offhand it's less clear why it might result in a loop. Unless of course there is a combination of events that allows for a busy wait on a value that will never change due to the lost write. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: process 'stuck' at exit. 2013-12-10 18:40 ` Linus Torvalds 2013-12-10 19:18 ` Thomas Gleixner @ 2013-12-10 20:57 ` Darren Hart 2013-12-10 21:09 ` Dave Jones 1 sibling, 1 reply; 114+ messages in thread From: Darren Hart @ 2013-12-10 20:57 UTC (permalink / raw) To: Linus Torvalds Cc: Dave Jones, Thomas Gleixner, Andrea Arcangeli, Linux Kernel Mailing List On Tue, 2013-12-10 at 10:40 -0800, Linus Torvalds wrote: > Hmm. Looks like the futex code is somehow stuck in a loop, calling > get_user_pages_fast(). > > The futex code itself is apparently so low-overhead that it doesn't > show up in your 'perf top' report (which is dominated by all the > expensive debug things that get_user_pages_fast() etc ends up doing), > but that's the only looping I can see. Perhaps the "goto again" case > for transparent huge pages in get_futex_key()? Or the > "retry[_private]" cases in futex_requeue()? Some error condition that > causes us to retry forever, rather than returning the error code? > > Added a few more people to the cc.. Ideas? > > Linus > > > On Tue, Dec 10, 2013 at 7:47 AM, Dave Jones <davej@redhat.com> wrote: > > I woke up to find my fuzzer in a curious state. > > > > 1121 pts/5 SN+ 0:00 | \_ ../trinity -q -l off -N 999999 -C 42 > > 1130 pts/5 SN+ 0:01 | \_ ../trinity -q -l off -N 999999 -C 42 > > 1131 pts/5 SN+ 0:17 | \_ ../trinity -q -l off -N 999999 -C 42 > > 10818 ? RNs 21115152:53 | \_ ../trinity -q -l off -N 999999 -C 42 > > > > (ignore the first 3 pids, they're waiting on 10818, which is the interesting one) > > > > It's completed its run of 999999 syscalls, and looking at tmux, it tried to exit. > > > > /proc/10818/stack just shows > > > > [<ffffffffffffffff>] 0xffffffffffffffff > > > > Top reports a core is spinning in the kernel 100%, so I ran perf top -a > > and saw.. > > > > 8.63% [kernel] [k] trace_hardirqs_off_caller > > 7.68% [kernel] [k] __lock_acquire > > 5.35% [kernel] [k] gup_huge_pmd > > 5.31% [kernel] [k] put_compound_page > > 4.93% [kernel] [k] gup_pud_range > > 4.76% [kernel] [k] trace_hardirqs_on_caller > > 4.58% [kernel] [k] get_user_pages_fast > > 4.00% [kernel] [k] debug_smp_processor_id > > 4.00% [kernel] [k] lock_is_held > > 3.73% [kernel] [k] lock_acquired > > 3.67% [kernel] [k] lock_release > > > > > > sysrq-t shows.. > > > > trinity-child27 R running task 5520 10818 1131 0x00080004 > > 0000000000000000 ffff8801b0ef4170 000000000000032c ffff8801b609e108 > > 0000000000000000 ffff880160d21c30 ffffffff810ad895 ffffffff817587a0 > > ffff8801b0ef4170 ffff8801b609e0a8 ffff8801b609e000 ffff880160d21d50 > > Call Trace: > > [<ffffffff817587a0>] ? retint_restore_args+0xe/0xe > > [<ffffffff8132af0e>] ? trace_hardirqs_on_thunk+0x3a/0x3f > > [<ffffffff8100b184>] ? native_sched_clock+0x24/0x80 > > [<ffffffff8109624f>] ? local_clock+0xf/0x50 > > [<ffffffff810aa27e>] ? put_lock_stats.isra.28+0xe/0x30 > > [<ffffffff8103edd0>] ? gup_pud_range+0x170/0x190 > > [<ffffffff8103f0d5>] ? get_user_pages_fast+0x1a5/0x1c0 > > [<ffffffff810ad1f5>] ? trace_hardirqs_on_caller+0x115/0x1e0 > > [<ffffffff810a8a2f>] ? up_read+0x1f/0x40 > > [<ffffffff8103f0d5>] ? get_user_pages_fast+0x1a5/0x1c0 > > [<ffffffff8115f76c>] ? put_page+0x3c/0x50 > > [<ffffffff810dd525>] ? get_futex_key+0xd5/0x2c0 > > [<ffffffff810df18a>] ? futex_requeue+0xfa/0x9c0 > > [<ffffffff810e019e>] ? do_futex+0xae/0xc80 > > [<ffffffff810aa27e>] ? put_lock_stats.isra.28+0xe/0x30 > > [<ffffffff810aa7de>] ? lock_release_holdtime.part.29+0xee/0x170 > > [<ffffffff8114f16e>] ? context_tracking_user_exit+0x4e/0x190 > > [<ffffffff810ad1f5>] ? trace_hardirqs_on_caller+0x115/0x1e0 > > [<ffffffff810e0de1>] ? SyS_futex+0x71/0x150 > > [<ffffffff81010a45>] ? syscall_trace_enter+0x145/0x2a0 > > [<ffffffff81760be4>] ? tracesys+0xdd/0xe2 > > Hi Dave, Can you get us an idea of the arguments trinity is tossing into SYS_futex? Op code? Would help to know if this was requeue_pi for example. Type of memory being used for the uaddr? I see futex_requeue in the stack, which means the opcode is one of: FUTEX_REQUEUE FUTEX_CMP_REQUEUE FUTEX_CMP_REQUEUE_PI FUTEX_REQUEUE has a known issue and was replaced with FUTEX_CMP_REQUEUE, for details, test cases, and an analysis see the historic tree: commit 9b91d73bde9d68800f9e5c338c0cf9d0fe3bc862 Author: Andrew Morton <akpm@osdl.org> Date: 2004-05-31 [PATCH] Add FUTEX_CMP_REQUEUE futex op Specifically: http://listman.redhat.com/archives/phil-list/2004-May/msg00023.html Trinity is going to trigger hangs in futexes just by it's very nature, but I believe you have watchdogs in place to kill such malformed tests after a timeout? I'll keep digging. -- Darren Hart Intel Open Source Technology Center Yocto Project - Linux Kernel ^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: process 'stuck' at exit. 2013-12-10 20:57 ` Darren Hart @ 2013-12-10 21:09 ` Dave Jones 0 siblings, 0 replies; 114+ messages in thread From: Dave Jones @ 2013-12-10 21:09 UTC (permalink / raw) To: Darren Hart Cc: Linus Torvalds, Thomas Gleixner, Andrea Arcangeli, Linux Kernel Mailing List On Tue, Dec 10, 2013 at 12:57:57PM -0800, Darren Hart wrote: > > > Call Trace: > > > [<ffffffff817587a0>] ? retint_restore_args+0xe/0xe > > > [<ffffffff8132af0e>] ? trace_hardirqs_on_thunk+0x3a/0x3f > > > [<ffffffff8100b184>] ? native_sched_clock+0x24/0x80 > > > [<ffffffff8109624f>] ? local_clock+0xf/0x50 > > > [<ffffffff810aa27e>] ? put_lock_stats.isra.28+0xe/0x30 > > > [<ffffffff8103edd0>] ? gup_pud_range+0x170/0x190 > > > [<ffffffff8103f0d5>] ? get_user_pages_fast+0x1a5/0x1c0 > > > [<ffffffff810ad1f5>] ? trace_hardirqs_on_caller+0x115/0x1e0 > > > [<ffffffff810a8a2f>] ? up_read+0x1f/0x40 > > > [<ffffffff8103f0d5>] ? get_user_pages_fast+0x1a5/0x1c0 > > > [<ffffffff8115f76c>] ? put_page+0x3c/0x50 > > > [<ffffffff810dd525>] ? get_futex_key+0xd5/0x2c0 > > > [<ffffffff810df18a>] ? futex_requeue+0xfa/0x9c0 > > > [<ffffffff810e019e>] ? do_futex+0xae/0xc80 > > > [<ffffffff810aa27e>] ? put_lock_stats.isra.28+0xe/0x30 > > > [<ffffffff810aa7de>] ? lock_release_holdtime.part.29+0xee/0x170 > > > [<ffffffff8114f16e>] ? context_tracking_user_exit+0x4e/0x190 > > > [<ffffffff810ad1f5>] ? trace_hardirqs_on_caller+0x115/0x1e0 > > > [<ffffffff810e0de1>] ? SyS_futex+0x71/0x150 > > > [<ffffffff81010a45>] ? syscall_trace_enter+0x145/0x2a0 > > > [<ffffffff81760be4>] ? tracesys+0xdd/0xe2 > > > > > Can you get us an idea of the arguments trinity is tossing into > SYS_futex? > > Op code? Would help to know if this was requeue_pi for example. > Type of memory being used for the uaddr? As is always the case, the interesting bugs only seem to happen when I have logging disabled. So other than what I can glean from what's left in the shm, no idea. One of the other child processes (which exited already) did do a sys_futex. the params it passed were.. 1cb5000, -1, c57, 1cb5004, ffffffffffd8f420, 90000000091a6311 The result of this syscall was -1 > I see futex_requeue in the stack, which means the opcode is one of: > > FUTEX_REQUEUE > FUTEX_CMP_REQUEUE > FUTEX_CMP_REQUEUE_PI > > FUTEX_REQUEUE has a known issue and was replaced with FUTEX_CMP_REQUEUE, > for details, test cases, and an analysis see the historic tree: > > commit 9b91d73bde9d68800f9e5c338c0cf9d0fe3bc862 > Author: Andrew Morton <akpm@osdl.org> > Date: 2004-05-31 > > [PATCH] Add FUTEX_CMP_REQUEUE futex op > > Specifically: > http://listman.redhat.com/archives/phil-list/2004-May/msg00023.html > > > Trinity is going to trigger hangs in futexes just by it's very nature, > but I believe you have watchdogs in place to kill such malformed tests > after a timeout? It should. Though that pid is happily ignoring the SIGKILL's the watchdog is continuing to send, because it's never getting around to processing the signals apparently. Dave ^ permalink raw reply [flat|nested] 114+ messages in thread
end of thread, other threads:[~2014-01-10 16:50 UTC | newest] Thread overview: 114+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-12-10 15:47 process 'stuck' at exit Dave Jones 2013-12-10 18:40 ` Linus Torvalds 2013-12-10 19:18 ` Thomas Gleixner 2013-12-10 19:55 ` Linus Torvalds 2013-12-10 20:27 ` Dave Jones 2013-12-10 20:34 ` Thomas Gleixner 2013-12-10 20:55 ` Dave Jones 2013-12-10 21:25 ` Darren Hart 2013-12-10 21:28 ` Thomas Gleixner 2013-12-10 21:39 ` Steven Rostedt 2013-12-10 20:33 ` Thomas Gleixner 2013-12-10 20:43 ` Linus Torvalds 2013-12-10 21:17 ` Thomas Gleixner 2013-12-10 20:35 ` Oleg Nesterov 2013-12-10 20:49 ` Dave Jones 2013-12-10 21:06 ` Darren Hart 2013-12-10 21:12 ` Dave Jones 2013-12-10 21:18 ` Linus Torvalds 2013-12-10 21:24 ` Linus Torvalds 2013-12-10 21:32 ` Dave Jones 2013-12-10 21:49 ` Linus Torvalds 2013-12-10 21:56 ` Dave Jones 2013-12-10 21:59 ` Linus Torvalds 2013-12-10 22:07 ` Dave Jones 2013-12-11 12:45 ` Ingo Molnar 2013-12-10 21:34 ` Oleg Nesterov 2013-12-10 21:41 ` Dave Jones 2013-12-10 21:57 ` Linus Torvalds 2013-12-10 22:02 ` Dave Jones 2013-12-10 22:09 ` Oleg Nesterov 2013-12-10 22:19 ` Linus Torvalds 2013-12-10 22:33 ` Linus Torvalds 2013-12-10 22:38 ` Darren Hart 2013-12-10 22:57 ` Thomas Gleixner 2013-12-10 23:05 ` Linus Torvalds 2013-12-10 23:28 ` Thomas Gleixner 2013-12-10 23:31 ` Al Viro 2013-12-11 17:08 ` Oleg Nesterov 2013-12-11 17:18 ` Thomas Gleixner 2013-12-11 17:56 ` Oleg Nesterov 2013-12-11 19:18 ` PATCH? introduce get_compound_page (Was: process 'stuck' at exit) Oleg Nesterov 2013-12-13 15:10 ` Andrea Arcangeli 2013-12-13 16:22 ` Oleg Nesterov 2013-12-13 17:34 ` Andrea Arcangeli 2013-12-16 18:36 ` Oleg Nesterov 2013-12-16 20:19 ` Andrea Arcangeli 2013-12-16 20:46 ` Oleg Nesterov 2013-12-17 16:53 ` Oleg Nesterov 2013-12-17 18:06 ` Andrea Arcangeli 2013-12-18 19:19 ` [PATCH -mm 0/7] (Was: introduce get_compound_page) Oleg Nesterov 2013-12-18 19:19 ` [PATCH -mm 1/7] mm: thp: introduce __put_nontail_page() Oleg Nesterov 2013-12-18 19:19 ` [PATCH -mm 2/7] mm: thp: change __get_page_tail() to use __put_nontail_page() Oleg Nesterov 2013-12-18 19:19 ` [PATCH -mm 3/7] mm: change release_pages() to use put_page() rather than put_compound_page() Oleg Nesterov 2013-12-18 19:19 ` [PATCH -mm 4/7] mm: thp: turn put_compound_page() into __put_page_tail() Oleg Nesterov 2013-12-18 19:36 ` Peter Zijlstra 2013-12-18 19:50 ` Oleg Nesterov 2013-12-18 19:20 ` [PATCH -mm 5/7] mm: thp: reorganize out_put_single code in __put_page_tail() Oleg Nesterov 2013-12-18 19:20 ` [PATCH -mm 6/7] mm: thp: introduce get_lock_thp_head() Oleg Nesterov 2013-12-18 21:37 ` Linus Torvalds 2013-12-19 16:29 ` Oleg Nesterov 2013-12-18 19:20 ` [PATCH -mm 7/7] mm: thp: introduce compound_head_put_tail(), change get_futex_key() to use it Oleg Nesterov 2013-12-18 19:28 ` Peter Zijlstra 2013-12-18 19:40 ` Oleg Nesterov 2013-12-19 19:08 ` [PATCH 0/1] mm: fix the theoretical compound_lock() vs prep_new_page() race Oleg Nesterov 2013-12-19 19:09 ` [PATCH 1/1] " Oleg Nesterov 2013-12-23 11:43 ` Andrea Arcangeli 2014-01-03 19:55 ` [PATCH v2 0/1] " Oleg Nesterov 2014-01-03 19:55 ` [PATCH v2 1/1] " Oleg Nesterov 2014-01-03 21:00 ` Andrew Morton 2014-01-04 16:43 ` Oleg Nesterov 2014-01-08 11:54 ` Mel Gorman 2014-01-08 13:14 ` Mel Gorman 2014-01-08 16:13 ` Oleg Nesterov 2014-01-08 18:02 ` Mel Gorman 2014-01-08 19:04 ` Oleg Nesterov 2014-01-09 11:27 ` Mel Gorman 2014-01-09 14:04 ` Oleg Nesterov 2014-01-09 18:52 ` Andrea Arcangeli 2014-01-09 19:43 ` Oleg Nesterov 2014-01-09 21:36 ` Andrea Arcangeli 2014-01-10 16:12 ` Oleg Nesterov 2014-01-10 16:50 ` Peter Zijlstra 2014-01-10 16:12 ` Mel Gorman 2013-12-20 14:19 ` [PATCH 0/1] " Martin Schwidefsky 2013-12-16 20:19 ` [PATCH 0/2] mm: thp: get_huge_page_tail() cleanups Oleg Nesterov 2013-12-16 20:19 ` [PATCH -mm 1/2] mm: thp: __get_page_tail_foll() can use get_huge_page_tail() Oleg Nesterov 2013-12-16 20:19 ` [PATCH -mm 2/2] mm: thp: turn compound_head() into BUG_ON(!PageTail) in get_huge_page_tail() Oleg Nesterov 2013-12-16 20:27 ` [PATCH 0/2] mm: thp: get_huge_page_tail() cleanups Andrea Arcangeli 2013-12-10 22:42 ` process 'stuck' at exit Thomas Gleixner 2013-12-10 22:48 ` Linus Torvalds 2013-12-10 22:58 ` Darren Hart 2013-12-10 23:01 ` Dave Jones 2013-12-10 23:00 ` Dave Jones 2013-12-11 0:05 ` Steven Rostedt 2013-12-11 0:23 ` Dave Jones 2013-12-11 0:55 ` Dave Jones 2013-12-14 20:17 ` Oleg Nesterov 2013-12-11 4:09 ` Dave Jones 2013-12-12 4:26 ` Dave Jones 2013-12-12 5:29 ` Darren Hart 2013-12-10 22:51 ` Darren Hart 2013-12-10 23:24 ` Al Viro 2013-12-11 16:31 ` Mel Gorman 2013-12-11 16:38 ` Thomas Gleixner 2013-12-11 17:57 ` Mel Gorman 2013-12-12 19:00 ` Andrea Arcangeli 2013-12-12 19:03 ` Linus Torvalds 2013-12-10 22:09 ` Steven Rostedt 2013-12-10 22:16 ` Dave Jones 2013-12-10 22:21 ` Steven Rostedt 2013-12-10 22:27 ` Dave Jones 2013-12-11 1:02 ` Mel Gorman 2013-12-10 20:57 ` Darren Hart 2013-12-10 21:09 ` Dave Jones
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).