From: vladimir.murzin@arm.com (Vladimir Murzin)
To: linux-arm-kernel@lists.infradead.org
Subject: [BISECTED] rcu_sched self-detected stall since 3.17
Date: Tue, 01 Dec 2015 11:50:17 +0000 [thread overview]
Message-ID: <565D8979.5050605@arm.com> (raw)
In-Reply-To: <564F3DCA.1080907@arm.com>
On 20/11/15 15:35, Vladimir Murzin wrote:
> Hi,
>
> I've been getting rcu_sched self-detected stall on one of our test
> environment with the following (or similar) log:
>
> Linux 4.3
Anybody looking into this? Oleg, Peter, any ideas what's happening here?
Thanks
Vladimir
>
>> INFO: rcu_sched self-detected stall on CPU
>> 2: (2099 ticks this GP) idle=a73/140000000000001/0 softirq=277/283 fqs=2092
>> (t=2100 jiffies g=-197 c=-198 q=230)
>> Task dump for CPU 2:
>> paranoia R running 0 423 1 0x00000003
>> [<c0016f20>] (unwind_backtrace) from [<c0012fd0>] (show_stack+0x10/0x14)
>> [<c0012fd0>] (show_stack) from [<c006c52c>] (rcu_dump_cpu_stacks+0xa4/0xc4)
>> [<c006c52c>] (rcu_dump_cpu_stacks) from [<c006f188>] (rcu_check_callbacks+0x2dc/0x81c)
>> [<c006f188>] (rcu_check_callbacks) from [<c0072278>] (update_process_times+0x38/0x64)
>> [<c0072278>] (update_process_times) from [<c007e7b8>] (tick_periodic+0xa8/0xb8)
>> [<c007e7b8>] (tick_periodic) from [<c007e904>] (tick_handle_periodic+0x28/0x88)
>> [<c007e904>] (tick_handle_periodic) from [<c0223074>] (arch_timer_handler_virt+0x28/0x30)
>> [<c0223074>] (arch_timer_handler_virt) from [<c0068008>] (handle_percpu_devid_irq+0x6c/0x84)
>> [<c0068008>] (handle_percpu_devid_irq) from [<c0064244>] (generic_handle_irq+0x24/0x34)
>> [<c0064244>] (generic_handle_irq) from [<c0064518>] (__handle_domain_irq+0x98/0xac)
>> [<c0064518>] (__handle_domain_irq) from [<c0009444>] (gic_handle_irq+0x54/0x90)
>> [<c0009444>] (gic_handle_irq) from [<c0013b14>] (__irq_svc+0x54/0x70)
>> Exception stack(0xdb007c60 to 0xdb007ca8)
>> 7c60: 00000000 00000000 00000000 00000001 dc3e9a20 00000000 00000000 00000001
>> 7c80: dc3e9a20 0000001b 00080000 0000000b ffffffff db007cb0 c00ac8f0 c00aa96c
>> 7ca0: 600d0113 ffffffff
>> [<c0013b14>] (__irq_svc) from [<c00aa96c>] (free_pages_prepare+0x18c/0x218)
>> [<c00aa96c>] (free_pages_prepare) from [<c00ac8f0>] (free_hot_cold_page+0x34/0x168)
>> [<c00ac8f0>] (free_hot_cold_page) from [<c00c996c>] (handle_mm_fault+0x7b8/0xe60)
>> [<c00c996c>] (handle_mm_fault) from [<c001c194>] (do_page_fault+0x12c/0x378)
>> [<c001c194>] (do_page_fault) from [<c00092dc>] (do_DataAbort+0x38/0xb4)
>> [<c00092dc>] (do_DataAbort) from [<c0013aa0>] (__dabt_svc+0x40/0x60)
>> Exception stack(0xdb007e68 to 0xdb007eb0)
>> 7e60: 0001b2c8 00000d30 00000000 00000055 00000051 dbb04600
>> 7e80: 00000000 db007ec8 db8ea700 db8f3080 dba43240 0001b2c8 00000000 db007eb8
>> 7ea0: c012c178 c0174ce8 200d0113 ffffffff
>> [<c0013aa0>] (__dabt_svc) from [<c0174ce8>] (__clear_user_std+0x34/0x68)
>> [<c0174ce8>] (__clear_user_std) from [<c012c178>] (padzero+0x58/0x74)
>> [<c012c178>] (padzero) from [<c012cb4c>] (load_elf_binary+0x730/0x1250)
>> [<c012cb4c>] (load_elf_binary) from [<c00ecd6c>] (search_binary_handler+0xa0/0x23c)
>> [<c00ecd6c>] (search_binary_handler) from [<c00ed730>] (do_execveat_common+0x444/0x5d8)
>> [<c00ed730>] (do_execveat_common) from [<c00ed8e8>] (do_execve+0x24/0x2c)
>> [<c00ed8e8>] (do_execve) from [<c000f580>] (ret_fast_syscall+0x0/0x3c)
>
> (I put report for Linux 3.17 under [2])
>
> This rcu_sched self-detected stall is usually reproduced 2 or 3 times
> out of 10 runs, but sometimes even more runs go without issue.
>
> I bisected [1] it to commit
>
> commit 743162013d40ca612b4cb53d3a200dff2d9ab26e
> Author: NeilBrown <neilb@suse.de>
> Date: Mon Jul 7 15:16:04 2014 +1000
>
> sched: Remove proliferation of wait_on_bit() action functions
>
> The only change I noticed is from (mm/filemap.c)
>
> io_schedule();
> fatal_signal_pending(current)
>
> to (kernel/sched/wait.c)
>
> signal_pending_state(current->state, current)
> io_schedule();
>
> and if I apply following diff I don't see stalls anymore.
>
> diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
> index a104879..2d68cdb 100644
> --- a/kernel/sched/wait.c
> +++ b/kernel/sched/wait.c
> @@ -514,9 +514,10 @@ EXPORT_SYMBOL(bit_wait);
>
> __sched int bit_wait_io(void *word)
> {
> + io_schedule();
> +
> if (signal_pending_state(current->state, current))
> return 1;
> - io_schedule();
> return 0;
> }
> EXPORT_SYMBOL(bit_wait_io);
>
> Any ideas why it might happen and why diff above helps?
>
> [1] Bisect log:
>
> I checked 4.[0-3] and 3.1[7-9] manually
> I have https://lkml.org/lkml/2015/8/26/649 for better report
>
>> git bisect start '--no-checkout'
>> # good: [19583ca584d6f574384e17fe7613dfaeadcdc4a6] Linux 3.16
>> git bisect good 19583ca584d6f574384e17fe7613dfaeadcdc4a6
>> # bad: [53ee983378ff23e8f3ff95ecf99dea7c6c221900] Merge tag 'staging-3.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
>> git bisect bad 53ee983378ff23e8f3ff95ecf99dea7c6c221900
>> # good: [2042088cd67d0064d18c52c13c69af2499907bb1] staging: comedi: ni_labpc: tidy up labpc_ai_scan_mode()
>> git bisect good 2042088cd67d0064d18c52c13c69af2499907bb1
>> # bad: [98959948a7ba33cf8c708626e0d2a1456397e1c6] Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
>> git bisect bad 98959948a7ba33cf8c708626e0d2a1456397e1c6
>> # good: [c9b88e9581828bb8bba06c5e7ee8ed1761172b6e] Merge tag 'trace-3.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
>> git bisect good c9b88e9581828bb8bba06c5e7ee8ed1761172b6e
>> # good: [5bda4f638f36ef4c4e3b1397b02affc3db94356e] Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
>> git bisect good 5bda4f638f36ef4c4e3b1397b02affc3db94356e
>> # good: [2336ebc32676df5b794acfe0c980583ec6c05f34] Merge tag 'perf-core-for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf into perf/core
>> git bisect good 2336ebc32676df5b794acfe0c980583ec6c05f34
>> # good: [ef35ad26f8ff44d2c93e29952cdb336bda729d9d] Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
>> git bisect good ef35ad26f8ff44d2c93e29952cdb336bda729d9d
>> # good: [1c5d3eb3759013bc7ee4197aa0a9f245bdb6eb90] sched/numa: Simplify task_numa_compare()
>> git bisect good 1c5d3eb3759013bc7ee4197aa0a9f245bdb6eb90
>> # good: [6e76ea8a8209386c3cc7ee5594e6ea5d25525cf2] sched: Remove extra static_key*() function indirection
>> git bisect good 6e76ea8a8209386c3cc7ee5594e6ea5d25525cf2
>> # bad: [c1221321b7c25b53204447cff9949a6d5a7ddddc] sched: Allow wait_on_bit_action() functions to support a timeout
>> git bisect bad c1221321b7c25b53204447cff9949a6d5a7ddddc
>> # good: [e720fff6341fe4b95e5a93c939bd3c77fa55ced4] sched/numa: Revert "Use effective_load() to balance NUMA loads"
>> git bisect good e720fff6341fe4b95e5a93c939bd3c77fa55ced4
>> # bad: [743162013d40ca612b4cb53d3a200dff2d9ab26e] sched: Remove proliferation of wait_on_bit() action functions
>> git bisect bad 743162013d40ca612b4cb53d3a200dff2d9ab26e
>> # good: [d26fad5b38e1c4667d4f2604936e59c837caa54d] Merge tag 'v3.16-rc5' into sched/core, to refresh the branch before applying bigger tree-wide changes
>> git bisect good d26fad5b38e1c4667d4f2604936e59c837caa54d
>
> [2] rcu_sched self-detected stall under Linux 3.17
>
>> INFO: rcu_sched self-detected stall on CPU { 2} (t=2100 jiffies g=-199 c=-200 q=172)
>> Task dump for CPU 2:
>> paranoia R running 0 414 1 0x00000003
>> [<c0015394>] (unwind_backtrace) from [<c00117dc>] (show_stack+0x10/0x14)
>> [<c00117dc>] (show_stack) from [<c00602d0>] (rcu_dump_cpu_stacks+0xa4/0xc4)
>> [<c00602d0>] (rcu_dump_cpu_stacks) from [<c0062aec>] (rcu_check_callbacks+0x280/0x6f8)
>> [<c0062aec>] (rcu_check_callbacks) from [<c0065b9c>] (update_process_times+0x40/0x60)
>> [<c0065b9c>] (update_process_times) from [<c007169c>] (tick_periodic+0xa8/0xb8)
>> [<c007169c>] (tick_periodic) from [<c00717bc>] (tick_handle_periodic+0x28/0x88)
>> [<c00717bc>] (tick_handle_periodic) from [<c01fc414>] (arch_timer_handler_virt+0x28/0x30)
>> [<c01fc414>] (arch_timer_handler_virt) from [<c005d100>] (handle_percpu_devid_irq+0x6c/0x84)
>> [<c005d100>] (handle_percpu_devid_irq) from [<c0059658>] (generic_handle_irq+0x2c/0x3c)
>> [<c0059658>] (generic_handle_irq) from [<c000ee3c>] (handle_IRQ+0x7c/0x8c)
>> [<c000ee3c>] (handle_IRQ) from [<c00085f8>] (gic_handle_irq+0x40/0x5c)
>> [<c00085f8>] (gic_handle_irq) from [<c0012300>] (__irq_svc+0x40/0x54)
>> Exception stack(0xdb063c60 to 0xdb063ca8)
>> 3c60: 00000000 0001b000 dbbfc000 0000000b dbbfc000 0000000b dbbfc000 0001b000
>> 3c80: 0001b000 dbb2d6e0 0000001b db03c380 0000006c db063ca8 c00b935c c00b935c
>> 3ca0: a00d0113 ffffffff
>> [<c0012300>] (__irq_svc) from [<c00b935c>] (do_cow_fault.isra.97+0x1c/0x198)
>> [<c00b935c>] (do_cow_fault.isra.97) from [<c00b9804>] (handle_mm_fault+0x14c/0x900)
>> [<c00b9804>] (handle_mm_fault) from [<c001ae38>] (do_page_fault+0x120/0x3d8)
>> [<c001ae38>] (do_page_fault) from [<c00084c0>] (do_DataAbort+0x38/0x98)
>> [<c00084c0>] (do_DataAbort) from [<c0012298>] (__dabt_svc+0x38/0x60)
>> Exception stack(0xdb063e70 to 0xdb063eb8)
>> 3e60: 0001b2c8 00000d30 00000000 00000000
>> 3e80: db902a80 dbadc100 00000000 db063ec8 db062008 dbadc200 0001b2c8 0001b6cc
>> 3ea0: 00000000 db063eb8 c011689c c015a884 200d0013 ffffffff
>> [<c0012298>] (__dabt_svc) from [<c015a884>] (__clear_user_std+0x34/0x64)
>> [<c015a884>] (__clear_user_std) from [<c011689c>] (padzero+0x44/0x58)
>> [<c011689c>] (padzero) from [<c01182e8>] (load_elf_binary+0x778/0x138c)
>> [<c01182e8>] (load_elf_binary) from [<c00da960>] (search_binary_handler+0x98/0x1d8)
>> [<c00da960>] (search_binary_handler) from [<c00dbc2c>] (do_execve+0x35c/0x4bc)
>> [<c00dbc2c>] (do_execve) from [<c000e520>] (ret_fast_syscall+0x0/0x30)
>
> Thanks
> Vladimir
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo at kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email at kvack.org </a>
>
>
>
WARNING: multiple messages have this Message-ID (diff)
From: Vladimir Murzin <vladimir.murzin@arm.com>
To: linux-kernel@vger.kernel.org
Cc: neilb@suse.de, oleg@redhat.com, peterz@infradead.org,
mark.rutland@arm.com, linux-arm-kernel@lists.infradead.org,
linux-mm@kvack.org
Subject: Re: [BISECTED] rcu_sched self-detected stall since 3.17
Date: Tue, 01 Dec 2015 11:50:17 +0000 [thread overview]
Message-ID: <565D8979.5050605@arm.com> (raw)
In-Reply-To: <564F3DCA.1080907@arm.com>
On 20/11/15 15:35, Vladimir Murzin wrote:
> Hi,
>
> I've been getting rcu_sched self-detected stall on one of our test
> environment with the following (or similar) log:
>
> Linux 4.3
Anybody looking into this? Oleg, Peter, any ideas what's happening here?
Thanks
Vladimir
>
>> INFO: rcu_sched self-detected stall on CPU
>> 2: (2099 ticks this GP) idle=a73/140000000000001/0 softirq=277/283 fqs=2092
>> (t=2100 jiffies g=-197 c=-198 q=230)
>> Task dump for CPU 2:
>> paranoia R running 0 423 1 0x00000003
>> [<c0016f20>] (unwind_backtrace) from [<c0012fd0>] (show_stack+0x10/0x14)
>> [<c0012fd0>] (show_stack) from [<c006c52c>] (rcu_dump_cpu_stacks+0xa4/0xc4)
>> [<c006c52c>] (rcu_dump_cpu_stacks) from [<c006f188>] (rcu_check_callbacks+0x2dc/0x81c)
>> [<c006f188>] (rcu_check_callbacks) from [<c0072278>] (update_process_times+0x38/0x64)
>> [<c0072278>] (update_process_times) from [<c007e7b8>] (tick_periodic+0xa8/0xb8)
>> [<c007e7b8>] (tick_periodic) from [<c007e904>] (tick_handle_periodic+0x28/0x88)
>> [<c007e904>] (tick_handle_periodic) from [<c0223074>] (arch_timer_handler_virt+0x28/0x30)
>> [<c0223074>] (arch_timer_handler_virt) from [<c0068008>] (handle_percpu_devid_irq+0x6c/0x84)
>> [<c0068008>] (handle_percpu_devid_irq) from [<c0064244>] (generic_handle_irq+0x24/0x34)
>> [<c0064244>] (generic_handle_irq) from [<c0064518>] (__handle_domain_irq+0x98/0xac)
>> [<c0064518>] (__handle_domain_irq) from [<c0009444>] (gic_handle_irq+0x54/0x90)
>> [<c0009444>] (gic_handle_irq) from [<c0013b14>] (__irq_svc+0x54/0x70)
>> Exception stack(0xdb007c60 to 0xdb007ca8)
>> 7c60: 00000000 00000000 00000000 00000001 dc3e9a20 00000000 00000000 00000001
>> 7c80: dc3e9a20 0000001b 00080000 0000000b ffffffff db007cb0 c00ac8f0 c00aa96c
>> 7ca0: 600d0113 ffffffff
>> [<c0013b14>] (__irq_svc) from [<c00aa96c>] (free_pages_prepare+0x18c/0x218)
>> [<c00aa96c>] (free_pages_prepare) from [<c00ac8f0>] (free_hot_cold_page+0x34/0x168)
>> [<c00ac8f0>] (free_hot_cold_page) from [<c00c996c>] (handle_mm_fault+0x7b8/0xe60)
>> [<c00c996c>] (handle_mm_fault) from [<c001c194>] (do_page_fault+0x12c/0x378)
>> [<c001c194>] (do_page_fault) from [<c00092dc>] (do_DataAbort+0x38/0xb4)
>> [<c00092dc>] (do_DataAbort) from [<c0013aa0>] (__dabt_svc+0x40/0x60)
>> Exception stack(0xdb007e68 to 0xdb007eb0)
>> 7e60: 0001b2c8 00000d30 00000000 00000055 00000051 dbb04600
>> 7e80: 00000000 db007ec8 db8ea700 db8f3080 dba43240 0001b2c8 00000000 db007eb8
>> 7ea0: c012c178 c0174ce8 200d0113 ffffffff
>> [<c0013aa0>] (__dabt_svc) from [<c0174ce8>] (__clear_user_std+0x34/0x68)
>> [<c0174ce8>] (__clear_user_std) from [<c012c178>] (padzero+0x58/0x74)
>> [<c012c178>] (padzero) from [<c012cb4c>] (load_elf_binary+0x730/0x1250)
>> [<c012cb4c>] (load_elf_binary) from [<c00ecd6c>] (search_binary_handler+0xa0/0x23c)
>> [<c00ecd6c>] (search_binary_handler) from [<c00ed730>] (do_execveat_common+0x444/0x5d8)
>> [<c00ed730>] (do_execveat_common) from [<c00ed8e8>] (do_execve+0x24/0x2c)
>> [<c00ed8e8>] (do_execve) from [<c000f580>] (ret_fast_syscall+0x0/0x3c)
>
> (I put report for Linux 3.17 under [2])
>
> This rcu_sched self-detected stall is usually reproduced 2 or 3 times
> out of 10 runs, but sometimes even more runs go without issue.
>
> I bisected [1] it to commit
>
> commit 743162013d40ca612b4cb53d3a200dff2d9ab26e
> Author: NeilBrown <neilb@suse.de>
> Date: Mon Jul 7 15:16:04 2014 +1000
>
> sched: Remove proliferation of wait_on_bit() action functions
>
> The only change I noticed is from (mm/filemap.c)
>
> io_schedule();
> fatal_signal_pending(current)
>
> to (kernel/sched/wait.c)
>
> signal_pending_state(current->state, current)
> io_schedule();
>
> and if I apply following diff I don't see stalls anymore.
>
> diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
> index a104879..2d68cdb 100644
> --- a/kernel/sched/wait.c
> +++ b/kernel/sched/wait.c
> @@ -514,9 +514,10 @@ EXPORT_SYMBOL(bit_wait);
>
> __sched int bit_wait_io(void *word)
> {
> + io_schedule();
> +
> if (signal_pending_state(current->state, current))
> return 1;
> - io_schedule();
> return 0;
> }
> EXPORT_SYMBOL(bit_wait_io);
>
> Any ideas why it might happen and why diff above helps?
>
> [1] Bisect log:
>
> I checked 4.[0-3] and 3.1[7-9] manually
> I have https://lkml.org/lkml/2015/8/26/649 for better report
>
>> git bisect start '--no-checkout'
>> # good: [19583ca584d6f574384e17fe7613dfaeadcdc4a6] Linux 3.16
>> git bisect good 19583ca584d6f574384e17fe7613dfaeadcdc4a6
>> # bad: [53ee983378ff23e8f3ff95ecf99dea7c6c221900] Merge tag 'staging-3.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
>> git bisect bad 53ee983378ff23e8f3ff95ecf99dea7c6c221900
>> # good: [2042088cd67d0064d18c52c13c69af2499907bb1] staging: comedi: ni_labpc: tidy up labpc_ai_scan_mode()
>> git bisect good 2042088cd67d0064d18c52c13c69af2499907bb1
>> # bad: [98959948a7ba33cf8c708626e0d2a1456397e1c6] Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
>> git bisect bad 98959948a7ba33cf8c708626e0d2a1456397e1c6
>> # good: [c9b88e9581828bb8bba06c5e7ee8ed1761172b6e] Merge tag 'trace-3.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
>> git bisect good c9b88e9581828bb8bba06c5e7ee8ed1761172b6e
>> # good: [5bda4f638f36ef4c4e3b1397b02affc3db94356e] Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
>> git bisect good 5bda4f638f36ef4c4e3b1397b02affc3db94356e
>> # good: [2336ebc32676df5b794acfe0c980583ec6c05f34] Merge tag 'perf-core-for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf into perf/core
>> git bisect good 2336ebc32676df5b794acfe0c980583ec6c05f34
>> # good: [ef35ad26f8ff44d2c93e29952cdb336bda729d9d] Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
>> git bisect good ef35ad26f8ff44d2c93e29952cdb336bda729d9d
>> # good: [1c5d3eb3759013bc7ee4197aa0a9f245bdb6eb90] sched/numa: Simplify task_numa_compare()
>> git bisect good 1c5d3eb3759013bc7ee4197aa0a9f245bdb6eb90
>> # good: [6e76ea8a8209386c3cc7ee5594e6ea5d25525cf2] sched: Remove extra static_key*() function indirection
>> git bisect good 6e76ea8a8209386c3cc7ee5594e6ea5d25525cf2
>> # bad: [c1221321b7c25b53204447cff9949a6d5a7ddddc] sched: Allow wait_on_bit_action() functions to support a timeout
>> git bisect bad c1221321b7c25b53204447cff9949a6d5a7ddddc
>> # good: [e720fff6341fe4b95e5a93c939bd3c77fa55ced4] sched/numa: Revert "Use effective_load() to balance NUMA loads"
>> git bisect good e720fff6341fe4b95e5a93c939bd3c77fa55ced4
>> # bad: [743162013d40ca612b4cb53d3a200dff2d9ab26e] sched: Remove proliferation of wait_on_bit() action functions
>> git bisect bad 743162013d40ca612b4cb53d3a200dff2d9ab26e
>> # good: [d26fad5b38e1c4667d4f2604936e59c837caa54d] Merge tag 'v3.16-rc5' into sched/core, to refresh the branch before applying bigger tree-wide changes
>> git bisect good d26fad5b38e1c4667d4f2604936e59c837caa54d
>
> [2] rcu_sched self-detected stall under Linux 3.17
>
>> INFO: rcu_sched self-detected stall on CPU { 2} (t=2100 jiffies g=-199 c=-200 q=172)
>> Task dump for CPU 2:
>> paranoia R running 0 414 1 0x00000003
>> [<c0015394>] (unwind_backtrace) from [<c00117dc>] (show_stack+0x10/0x14)
>> [<c00117dc>] (show_stack) from [<c00602d0>] (rcu_dump_cpu_stacks+0xa4/0xc4)
>> [<c00602d0>] (rcu_dump_cpu_stacks) from [<c0062aec>] (rcu_check_callbacks+0x280/0x6f8)
>> [<c0062aec>] (rcu_check_callbacks) from [<c0065b9c>] (update_process_times+0x40/0x60)
>> [<c0065b9c>] (update_process_times) from [<c007169c>] (tick_periodic+0xa8/0xb8)
>> [<c007169c>] (tick_periodic) from [<c00717bc>] (tick_handle_periodic+0x28/0x88)
>> [<c00717bc>] (tick_handle_periodic) from [<c01fc414>] (arch_timer_handler_virt+0x28/0x30)
>> [<c01fc414>] (arch_timer_handler_virt) from [<c005d100>] (handle_percpu_devid_irq+0x6c/0x84)
>> [<c005d100>] (handle_percpu_devid_irq) from [<c0059658>] (generic_handle_irq+0x2c/0x3c)
>> [<c0059658>] (generic_handle_irq) from [<c000ee3c>] (handle_IRQ+0x7c/0x8c)
>> [<c000ee3c>] (handle_IRQ) from [<c00085f8>] (gic_handle_irq+0x40/0x5c)
>> [<c00085f8>] (gic_handle_irq) from [<c0012300>] (__irq_svc+0x40/0x54)
>> Exception stack(0xdb063c60 to 0xdb063ca8)
>> 3c60: 00000000 0001b000 dbbfc000 0000000b dbbfc000 0000000b dbbfc000 0001b000
>> 3c80: 0001b000 dbb2d6e0 0000001b db03c380 0000006c db063ca8 c00b935c c00b935c
>> 3ca0: a00d0113 ffffffff
>> [<c0012300>] (__irq_svc) from [<c00b935c>] (do_cow_fault.isra.97+0x1c/0x198)
>> [<c00b935c>] (do_cow_fault.isra.97) from [<c00b9804>] (handle_mm_fault+0x14c/0x900)
>> [<c00b9804>] (handle_mm_fault) from [<c001ae38>] (do_page_fault+0x120/0x3d8)
>> [<c001ae38>] (do_page_fault) from [<c00084c0>] (do_DataAbort+0x38/0x98)
>> [<c00084c0>] (do_DataAbort) from [<c0012298>] (__dabt_svc+0x38/0x60)
>> Exception stack(0xdb063e70 to 0xdb063eb8)
>> 3e60: 0001b2c8 00000d30 00000000 00000000
>> 3e80: db902a80 dbadc100 00000000 db063ec8 db062008 dbadc200 0001b2c8 0001b6cc
>> 3ea0: 00000000 db063eb8 c011689c c015a884 200d0013 ffffffff
>> [<c0012298>] (__dabt_svc) from [<c015a884>] (__clear_user_std+0x34/0x64)
>> [<c015a884>] (__clear_user_std) from [<c011689c>] (padzero+0x44/0x58)
>> [<c011689c>] (padzero) from [<c01182e8>] (load_elf_binary+0x778/0x138c)
>> [<c01182e8>] (load_elf_binary) from [<c00da960>] (search_binary_handler+0x98/0x1d8)
>> [<c00da960>] (search_binary_handler) from [<c00dbc2c>] (do_execve+0x35c/0x4bc)
>> [<c00dbc2c>] (do_execve) from [<c000e520>] (ret_fast_syscall+0x0/0x30)
>
> Thanks
> Vladimir
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
>
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
WARNING: multiple messages have this Message-ID (diff)
From: Vladimir Murzin <vladimir.murzin@arm.com>
To: linux-kernel@vger.kernel.org
Cc: neilb@suse.de, oleg@redhat.com, peterz@infradead.org,
mark.rutland@arm.com, linux-arm-kernel@lists.infradead.org,
linux-mm@kvack.org
Subject: Re: [BISECTED] rcu_sched self-detected stall since 3.17
Date: Tue, 01 Dec 2015 11:50:17 +0000 [thread overview]
Message-ID: <565D8979.5050605@arm.com> (raw)
In-Reply-To: <564F3DCA.1080907@arm.com>
On 20/11/15 15:35, Vladimir Murzin wrote:
> Hi,
>
> I've been getting rcu_sched self-detected stall on one of our test
> environment with the following (or similar) log:
>
> Linux 4.3
Anybody looking into this? Oleg, Peter, any ideas what's happening here?
Thanks
Vladimir
>
>> INFO: rcu_sched self-detected stall on CPU
>> 2: (2099 ticks this GP) idle=a73/140000000000001/0 softirq=277/283 fqs=2092
>> (t=2100 jiffies g=-197 c=-198 q=230)
>> Task dump for CPU 2:
>> paranoia R running 0 423 1 0x00000003
>> [<c0016f20>] (unwind_backtrace) from [<c0012fd0>] (show_stack+0x10/0x14)
>> [<c0012fd0>] (show_stack) from [<c006c52c>] (rcu_dump_cpu_stacks+0xa4/0xc4)
>> [<c006c52c>] (rcu_dump_cpu_stacks) from [<c006f188>] (rcu_check_callbacks+0x2dc/0x81c)
>> [<c006f188>] (rcu_check_callbacks) from [<c0072278>] (update_process_times+0x38/0x64)
>> [<c0072278>] (update_process_times) from [<c007e7b8>] (tick_periodic+0xa8/0xb8)
>> [<c007e7b8>] (tick_periodic) from [<c007e904>] (tick_handle_periodic+0x28/0x88)
>> [<c007e904>] (tick_handle_periodic) from [<c0223074>] (arch_timer_handler_virt+0x28/0x30)
>> [<c0223074>] (arch_timer_handler_virt) from [<c0068008>] (handle_percpu_devid_irq+0x6c/0x84)
>> [<c0068008>] (handle_percpu_devid_irq) from [<c0064244>] (generic_handle_irq+0x24/0x34)
>> [<c0064244>] (generic_handle_irq) from [<c0064518>] (__handle_domain_irq+0x98/0xac)
>> [<c0064518>] (__handle_domain_irq) from [<c0009444>] (gic_handle_irq+0x54/0x90)
>> [<c0009444>] (gic_handle_irq) from [<c0013b14>] (__irq_svc+0x54/0x70)
>> Exception stack(0xdb007c60 to 0xdb007ca8)
>> 7c60: 00000000 00000000 00000000 00000001 dc3e9a20 00000000 00000000 00000001
>> 7c80: dc3e9a20 0000001b 00080000 0000000b ffffffff db007cb0 c00ac8f0 c00aa96c
>> 7ca0: 600d0113 ffffffff
>> [<c0013b14>] (__irq_svc) from [<c00aa96c>] (free_pages_prepare+0x18c/0x218)
>> [<c00aa96c>] (free_pages_prepare) from [<c00ac8f0>] (free_hot_cold_page+0x34/0x168)
>> [<c00ac8f0>] (free_hot_cold_page) from [<c00c996c>] (handle_mm_fault+0x7b8/0xe60)
>> [<c00c996c>] (handle_mm_fault) from [<c001c194>] (do_page_fault+0x12c/0x378)
>> [<c001c194>] (do_page_fault) from [<c00092dc>] (do_DataAbort+0x38/0xb4)
>> [<c00092dc>] (do_DataAbort) from [<c0013aa0>] (__dabt_svc+0x40/0x60)
>> Exception stack(0xdb007e68 to 0xdb007eb0)
>> 7e60: 0001b2c8 00000d30 00000000 00000055 00000051 dbb04600
>> 7e80: 00000000 db007ec8 db8ea700 db8f3080 dba43240 0001b2c8 00000000 db007eb8
>> 7ea0: c012c178 c0174ce8 200d0113 ffffffff
>> [<c0013aa0>] (__dabt_svc) from [<c0174ce8>] (__clear_user_std+0x34/0x68)
>> [<c0174ce8>] (__clear_user_std) from [<c012c178>] (padzero+0x58/0x74)
>> [<c012c178>] (padzero) from [<c012cb4c>] (load_elf_binary+0x730/0x1250)
>> [<c012cb4c>] (load_elf_binary) from [<c00ecd6c>] (search_binary_handler+0xa0/0x23c)
>> [<c00ecd6c>] (search_binary_handler) from [<c00ed730>] (do_execveat_common+0x444/0x5d8)
>> [<c00ed730>] (do_execveat_common) from [<c00ed8e8>] (do_execve+0x24/0x2c)
>> [<c00ed8e8>] (do_execve) from [<c000f580>] (ret_fast_syscall+0x0/0x3c)
>
> (I put report for Linux 3.17 under [2])
>
> This rcu_sched self-detected stall is usually reproduced 2 or 3 times
> out of 10 runs, but sometimes even more runs go without issue.
>
> I bisected [1] it to commit
>
> commit 743162013d40ca612b4cb53d3a200dff2d9ab26e
> Author: NeilBrown <neilb@suse.de>
> Date: Mon Jul 7 15:16:04 2014 +1000
>
> sched: Remove proliferation of wait_on_bit() action functions
>
> The only change I noticed is from (mm/filemap.c)
>
> io_schedule();
> fatal_signal_pending(current)
>
> to (kernel/sched/wait.c)
>
> signal_pending_state(current->state, current)
> io_schedule();
>
> and if I apply following diff I don't see stalls anymore.
>
> diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
> index a104879..2d68cdb 100644
> --- a/kernel/sched/wait.c
> +++ b/kernel/sched/wait.c
> @@ -514,9 +514,10 @@ EXPORT_SYMBOL(bit_wait);
>
> __sched int bit_wait_io(void *word)
> {
> + io_schedule();
> +
> if (signal_pending_state(current->state, current))
> return 1;
> - io_schedule();
> return 0;
> }
> EXPORT_SYMBOL(bit_wait_io);
>
> Any ideas why it might happen and why diff above helps?
>
> [1] Bisect log:
>
> I checked 4.[0-3] and 3.1[7-9] manually
> I have https://lkml.org/lkml/2015/8/26/649 for better report
>
>> git bisect start '--no-checkout'
>> # good: [19583ca584d6f574384e17fe7613dfaeadcdc4a6] Linux 3.16
>> git bisect good 19583ca584d6f574384e17fe7613dfaeadcdc4a6
>> # bad: [53ee983378ff23e8f3ff95ecf99dea7c6c221900] Merge tag 'staging-3.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
>> git bisect bad 53ee983378ff23e8f3ff95ecf99dea7c6c221900
>> # good: [2042088cd67d0064d18c52c13c69af2499907bb1] staging: comedi: ni_labpc: tidy up labpc_ai_scan_mode()
>> git bisect good 2042088cd67d0064d18c52c13c69af2499907bb1
>> # bad: [98959948a7ba33cf8c708626e0d2a1456397e1c6] Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
>> git bisect bad 98959948a7ba33cf8c708626e0d2a1456397e1c6
>> # good: [c9b88e9581828bb8bba06c5e7ee8ed1761172b6e] Merge tag 'trace-3.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
>> git bisect good c9b88e9581828bb8bba06c5e7ee8ed1761172b6e
>> # good: [5bda4f638f36ef4c4e3b1397b02affc3db94356e] Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
>> git bisect good 5bda4f638f36ef4c4e3b1397b02affc3db94356e
>> # good: [2336ebc32676df5b794acfe0c980583ec6c05f34] Merge tag 'perf-core-for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf into perf/core
>> git bisect good 2336ebc32676df5b794acfe0c980583ec6c05f34
>> # good: [ef35ad26f8ff44d2c93e29952cdb336bda729d9d] Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
>> git bisect good ef35ad26f8ff44d2c93e29952cdb336bda729d9d
>> # good: [1c5d3eb3759013bc7ee4197aa0a9f245bdb6eb90] sched/numa: Simplify task_numa_compare()
>> git bisect good 1c5d3eb3759013bc7ee4197aa0a9f245bdb6eb90
>> # good: [6e76ea8a8209386c3cc7ee5594e6ea5d25525cf2] sched: Remove extra static_key*() function indirection
>> git bisect good 6e76ea8a8209386c3cc7ee5594e6ea5d25525cf2
>> # bad: [c1221321b7c25b53204447cff9949a6d5a7ddddc] sched: Allow wait_on_bit_action() functions to support a timeout
>> git bisect bad c1221321b7c25b53204447cff9949a6d5a7ddddc
>> # good: [e720fff6341fe4b95e5a93c939bd3c77fa55ced4] sched/numa: Revert "Use effective_load() to balance NUMA loads"
>> git bisect good e720fff6341fe4b95e5a93c939bd3c77fa55ced4
>> # bad: [743162013d40ca612b4cb53d3a200dff2d9ab26e] sched: Remove proliferation of wait_on_bit() action functions
>> git bisect bad 743162013d40ca612b4cb53d3a200dff2d9ab26e
>> # good: [d26fad5b38e1c4667d4f2604936e59c837caa54d] Merge tag 'v3.16-rc5' into sched/core, to refresh the branch before applying bigger tree-wide changes
>> git bisect good d26fad5b38e1c4667d4f2604936e59c837caa54d
>
> [2] rcu_sched self-detected stall under Linux 3.17
>
>> INFO: rcu_sched self-detected stall on CPU { 2} (t=2100 jiffies g=-199 c=-200 q=172)
>> Task dump for CPU 2:
>> paranoia R running 0 414 1 0x00000003
>> [<c0015394>] (unwind_backtrace) from [<c00117dc>] (show_stack+0x10/0x14)
>> [<c00117dc>] (show_stack) from [<c00602d0>] (rcu_dump_cpu_stacks+0xa4/0xc4)
>> [<c00602d0>] (rcu_dump_cpu_stacks) from [<c0062aec>] (rcu_check_callbacks+0x280/0x6f8)
>> [<c0062aec>] (rcu_check_callbacks) from [<c0065b9c>] (update_process_times+0x40/0x60)
>> [<c0065b9c>] (update_process_times) from [<c007169c>] (tick_periodic+0xa8/0xb8)
>> [<c007169c>] (tick_periodic) from [<c00717bc>] (tick_handle_periodic+0x28/0x88)
>> [<c00717bc>] (tick_handle_periodic) from [<c01fc414>] (arch_timer_handler_virt+0x28/0x30)
>> [<c01fc414>] (arch_timer_handler_virt) from [<c005d100>] (handle_percpu_devid_irq+0x6c/0x84)
>> [<c005d100>] (handle_percpu_devid_irq) from [<c0059658>] (generic_handle_irq+0x2c/0x3c)
>> [<c0059658>] (generic_handle_irq) from [<c000ee3c>] (handle_IRQ+0x7c/0x8c)
>> [<c000ee3c>] (handle_IRQ) from [<c00085f8>] (gic_handle_irq+0x40/0x5c)
>> [<c00085f8>] (gic_handle_irq) from [<c0012300>] (__irq_svc+0x40/0x54)
>> Exception stack(0xdb063c60 to 0xdb063ca8)
>> 3c60: 00000000 0001b000 dbbfc000 0000000b dbbfc000 0000000b dbbfc000 0001b000
>> 3c80: 0001b000 dbb2d6e0 0000001b db03c380 0000006c db063ca8 c00b935c c00b935c
>> 3ca0: a00d0113 ffffffff
>> [<c0012300>] (__irq_svc) from [<c00b935c>] (do_cow_fault.isra.97+0x1c/0x198)
>> [<c00b935c>] (do_cow_fault.isra.97) from [<c00b9804>] (handle_mm_fault+0x14c/0x900)
>> [<c00b9804>] (handle_mm_fault) from [<c001ae38>] (do_page_fault+0x120/0x3d8)
>> [<c001ae38>] (do_page_fault) from [<c00084c0>] (do_DataAbort+0x38/0x98)
>> [<c00084c0>] (do_DataAbort) from [<c0012298>] (__dabt_svc+0x38/0x60)
>> Exception stack(0xdb063e70 to 0xdb063eb8)
>> 3e60: 0001b2c8 00000d30 00000000 00000000
>> 3e80: db902a80 dbadc100 00000000 db063ec8 db062008 dbadc200 0001b2c8 0001b6cc
>> 3ea0: 00000000 db063eb8 c011689c c015a884 200d0013 ffffffff
>> [<c0012298>] (__dabt_svc) from [<c015a884>] (__clear_user_std+0x34/0x64)
>> [<c015a884>] (__clear_user_std) from [<c011689c>] (padzero+0x44/0x58)
>> [<c011689c>] (padzero) from [<c01182e8>] (load_elf_binary+0x778/0x138c)
>> [<c01182e8>] (load_elf_binary) from [<c00da960>] (search_binary_handler+0x98/0x1d8)
>> [<c00da960>] (search_binary_handler) from [<c00dbc2c>] (do_execve+0x35c/0x4bc)
>> [<c00dbc2c>] (do_execve) from [<c000e520>] (ret_fast_syscall+0x0/0x30)
>
> Thanks
> Vladimir
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
>
>
next prev parent reply other threads:[~2015-12-01 11:50 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-11-20 15:35 [BISECTED] rcu_sched self-detected stall since 3.17 Vladimir Murzin
2015-11-20 15:35 ` Vladimir Murzin
2015-11-20 15:35 ` Vladimir Murzin
2015-12-01 11:50 ` Vladimir Murzin [this message]
2015-12-01 11:50 ` Vladimir Murzin
2015-12-01 11:50 ` Vladimir Murzin
2015-12-01 13:04 ` Peter Zijlstra
2015-12-01 13:04 ` Peter Zijlstra
2015-12-01 13:04 ` Peter Zijlstra
2015-12-02 9:04 ` Vladimir Murzin
2015-12-02 9:04 ` Vladimir Murzin
2015-12-02 9:04 ` Vladimir Murzin
2015-12-04 11:52 ` [tip:locking/core] sched/wait: Fix signal handling in bit wait helpers tip-bot for Peter Zijlstra
2015-12-08 10:47 ` Peter Zijlstra
2015-12-09 1:06 ` NeilBrown
2015-12-09 7:40 ` Peter Zijlstra
2015-12-09 21:30 ` NeilBrown
2015-12-10 13:09 ` Peter Zijlstra
2015-12-11 11:30 ` Paul Turner
2015-12-11 11:39 ` Peter Zijlstra
2015-12-11 11:53 ` Vladimir Murzin
2015-12-11 13:08 ` Jan Stancek
2015-12-11 13:22 ` Peter Zijlstra
2015-12-11 17:57 ` Vladimir Murzin
2015-12-15 18:16 ` Oleg Nesterov
2015-12-15 19:01 ` Oleg Nesterov
2015-12-15 16:56 ` [BISECTED] rcu_sched self-detected stall since 3.17 Oleg Nesterov
2015-12-15 16:56 ` Oleg Nesterov
2015-12-15 16:56 ` Oleg Nesterov
-- strict thread matches above, loose matches on Subject: below --
2015-12-02 6:53 Ross Green
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=565D8979.5050605@arm.com \
--to=vladimir.murzin@arm.com \
--cc=linux-arm-kernel@lists.infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.