From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
To: Philipp Hahn <hahn@univention.de>,
Ankur Arora <ankur.a.arora@oracle.com>
Cc: xen-devel <xen-devel@lists.xenproject.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: 4.9.52: INFO: task XXX blocked for more than 300 seconds.
Date: Thu, 5 Oct 2017 09:20:16 -0400 [thread overview]
Message-ID: <20171005132016.GG24547@char.us.oracle.com> (raw)
In-Reply-To: <ea235063-e47b-6475-ebf0-262883200225@univention.de>
On Wed, Oct 04, 2017 at 08:26:27PM +0200, Philipp Hahn wrote:
> Hello,
>
Adding Ankur to this as I think he saw something similar.
But in the meantime - do you see this with the latest version of Linux?
> with linux-4.9.52 running on Debian-Wheezy with Xen-4.1 I observed
> several stuck processes: Here is one (truncated) dump of the Linux
> kernel messages:
>
> > [<ffffffff8160db6d>] ? __schedule+0x23d/0x6d0
> > [<ffffffff8160e8a0>] ? bit_wait_timeout+0x90/0x90
> > [<ffffffff8160e032>] ? schedule+0x32/0x80
> > [<ffffffff8161150c>] ? schedule_timeout+0x1ec/0x360
> > [<ffffffff8130f277>] ? __blk_mq_run_hw_queue+0x327/0x3e0* see below
> > [<ffffffff8101b7f1>] ? xen_clocksource_get_cycles+0x11/0x20
> > [<ffffffff8160e8a0>] ? bit_wait_timeout+0x90/0x90
> > [<ffffffff8160d8b4>] ? io_schedule_timeout+0xb4/0x130
> > [<ffffffff810bb6f7>] ? prepare_to_wait+0x57/0x80
> > [<ffffffff8160e8b7>] ? bit_wait_io+0x17/0x60
> > [<ffffffff8160e3ac>] ? __wait_on_bit+0x5c/0x90
> > [<ffffffff8160e8a0>] ? bit_wait_timeout+0x90/0x90
> > [<ffffffff8160e50e>] ? out_of_line_wait_on_bit+0x7e/0xa0
> > [<ffffffff810bba20>] ? autoremove_wake_function+0x40/0x40
> > [<ffffffffc00abd18>] ? jbd2_journal_commit_transaction+0xd48/0x17e0 [jbd2]
> > [<ffffffff810247d9>] ? __switch_to+0x2c9/0x720
> > [<ffffffff810e67fd>] ? try_to_del_timer_sync+0x4d/0x80
> > [<ffffffffc00b099d>] ? kjournald2+0xdd/0x280 [jbd2]
> > [<ffffffff810bb9e0>] ? wake_up_atomic_t+0x30/0x30
> > [<ffffffffc00b08c0>] ? commit_timeout+0x10/0x10 [jbd2]
> > [<ffffffff81097660>] ? kthread+0xf0/0x110
> > [<ffffffff810247d9>] ? __switch_to+0x2c9/0x720
> > [<ffffffff81097570>] ? kthread_park+0x60/0x60
> > [<ffffffff81612bb5>] ? ret_from_fork+0x25/0x30
> > NMI backtrace for cpu 2
> > CPU: 2 PID: 35 Comm: khungtaskd Not tainted 4.9.0-ucs105-amd64 #1 Debian 4.9.30-2A~4.2.0.201709271649
> > 0000000000000000 ffffffff81331935 0000000000000000 0000000000000002
> > ffffffff81335e60 0000000000000002 ffffffff8104cb70 ffff8801f0c90e80
> > ffffffff81335f6a ffff8801f0c90e80 00000000003fffbc ffffffff81128048
> > Call Trace:
> > [<ffffffff81331935>] ? dump_stack+0x5c/0x77
> > [<ffffffff81335e60>] ? nmi_cpu_backtrace+0x90/0xa0
> > [<ffffffff8104cb70>] ? irq_force_complete_move+0x140/0x140
> > [<ffffffff81335f6a>] ? nmi_trigger_cpumask_backtrace+0xfa/0x130
> > [<ffffffff81128048>] ? watchdog+0x2b8/0x330
> > [<ffffffff81127d90>] ? reset_hung_task_detector+0x10/0x10
> > [<ffffffff81097660>] ? kthread+0xf0/0x110
> > [<ffffffff810247d9>] ? __switch_to+0x2c9/0x720
> > [<ffffffff81097570>] ? kthread_park+0x60/0x60
> > [<ffffffff81612bb5>] ? ret_from_fork+0x25/0x30
> > Sending NMI from CPU 2 to CPUs 0-1,3:
> > NMI backtrace for cpu 1
> > CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.9.0-ucs105-amd64 #1 Debian 4.9.30-2A~4.2.0.201709271649
> > task: ffff8801f4a02ec0 task.stack: ffffc90040ca4000
> > RIP: e030:[<ffffffff810013aa>] [<ffffffff810013aa>] xen_hypercall_sched_op+0xa/0x20
> > RSP: e02b:ffffc90040ca7ed0 EFLAGS: 00000246
> > RAX: 0000000000000000 RBX: ffff8801f4a02ec0 RCX: ffffffff810013aa
> > RDX: ffffffff81c4ba70 RSI: 0000000000000000 RDI: 0000000000000001
> > RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000
> > R10: 0000000000007ff0 R11: 0000000000000246 R12: 0000000000000000
> > R13: 0000000000000000 R14: ffff8801f4a02ec0 R15: ffff8801f4a02ec0
> > FS: 00007f23ac595700(0000) GS:ffff8801f5a80000(0000) knlGS:0000000000000000
> > CS: e033 DS: 002b ES: 002b CR0: 0000000080050033
> > CR2: 00007f52537d6d46 CR3: 00000001bba23000 CR4: 0000000000002660
> > Stack:
> > ffff8801bb832201 0000000000000001 ffffffff8101b55c ffffffff81611ec8
> > ffff8801f4a02ec0 0000000000000001 ffffffff810bc280 ffff8801f4a02ec0
> > ffff8801f4a02ec0 c0a995961d41240f addce6dcadd009c9 0000000000000000
> > Call Trace:
> > [<ffffffff8101b55c>] ? xen_safe_halt+0xc/0x20
> > [<ffffffff81611ec8>] ? default_idle+0x18/0xd0
> > [<ffffffff810bc280>] ? cpu_startup_entry+0x1f0/0x260
> > Code: cc 51 41 53 b8 1c 00 00 00 0f 05 41 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 51 41 53 b8 1d 00 00 00 0f 05 <41> 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
> > NMI backtrace for cpu 3
> > CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.9.0-ucs105-amd64 #1 Debian 4.9.30-2A~4.2.0.201709271649
> > task: ffff8801f4a24f00 task.stack: ffffc90040cb4000
> > RIP: e030:[<ffffffff810013aa>] [<ffffffff810013aa>] xen_hypercall_sched_op+0xa/0x20
> > RSP: e02b:ffffc90040cb7ed0 EFLAGS: 00000246
> > RAX: 0000000000000000 RBX: ffff8801f4a24f00 RCX: ffffffff810013aa
> > RDX: ffffffff81c4ba70 RSI: 0000000000000000 RDI: 0000000000000001
> > RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000
> > R10: 0000000000007ff0 R11: 0000000000000246 R12: 0000000000000000
> > R13: 0000000000000000 R14: ffff8801f4a24f00 R15: ffff8801f4a24f00
> > FS: 00007f1a2af19700(0000) GS:ffff8801f5b80000(0000) knlGS:0000000000000000
> > CS: e033 DS: 002b ES: 002b CR0: 0000000080050033
> > CR2: 00007f4a5416b000 CR3: 00000001d83ec000 CR4: 0000000000002660
> > Stack:
> > 0000000000000001 0000000000000001 ffffffff8101b55c ffffffff81611ec8
> > ffff8801f4a24f00 0000000000000003 ffffffff810bc280 ffff8801f4a24f00
> > ffff8801f4a24f00 77816deb133b9979 addce6dcadd009c9 0000000000000000
> > Call Trace:
> > [<ffffffff8101b55c>] ? xen_safe_halt+0xc/0x20
> > [<ffffffff81611ec8>] ? default_idle+0x18/0xd0
> > [<ffffffff810bc280>] ? cpu_startup_entry+0x1f0/0x260
> > Code: cc 51 41 53 b8 1c 00 00 00 0f 05 41 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 51 41 53 b8 1d 00 00 00 0f 05 <41> 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
> > NMI backtrace for cpu 0
> > CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.9.0-ucs105-amd64 #1 Debian 4.9.30-2A~4.2.0.201709271649
> > task: ffffffff81c0e540 task.stack: ffffffff81c00000
> > RIP: e030:[<ffffffff810013aa>] [<ffffffff810013aa>] xen_hypercall_sched_op+0xa/0x20
> > RSP: e02b:ffffffff81c03e90 EFLAGS: 00000246
> > RAX: 0000000000000000 RBX: ffffffff81c0e540 RCX: ffffffff810013aa
> > RDX: ffffffff81c4ba70 RSI: 0000000000000000 RDI: 0000000000000001
> > RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
> > R10: 0000000000007ff0 R11: 0000000000000246 R12: 0000000000000000
> > R13: 0000000000000000 R14: ffffffff81c0e540 R15: ffffffff81c0e540
> > FS: 00007f977d3c2700(0000) GS:ffff8801f5a00000(0000) knlGS:0000000000000000
> > CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 000055ae9becabc8 CR3: 00000001d6a40000 CR4: 0000000000002660
> > Stack:
> > 0000000000000001 0000000000000001 ffffffff8101b55c ffffffff81611ec8
> > ffffffff81c0e540 0000000000000000 ffffffff810bc280 ffffffff81c0e540
> > ffffffff81c0e540 ebeac214e330bd3f addce6dcadd009c9 ffffffffffffffff
> > Call Trace:
> > [<ffffffff8101b55c>] ? xen_safe_halt+0xc/0x20
> > [<ffffffff81611ec8>] ? default_idle+0x18/0xd0
> > [<ffffffff810bc280>] ? cpu_startup_entry+0x1f0/0x260
> > [<ffffffff81d4bf84>] ? start_kernel+0x46d/0x48d
> > [<ffffffff81d510e6>] ? xen_start_kernel+0x52e/0x538
> > Code: cc 51 41 53 b8 1c 00 00 00 0f 05 41 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 51 41 53 b8 1d 00 00 00 0f 05 <41> 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
>
>
> Looking at the dis-assembly of xen_clocksource_get_cycles() in
> arch/x86/xen/time.c I see no path how that should call
> __blk_mq_run_hw_queue():
>
> > 00000000000001a0 <xen_clocksource_get_cycles> mov %gs:0x0(%rip),%rdi # 00000000000001a8 <xen_clocksource_get_cycles+0x8>
> > 00000000000001a8 <xen_clocksource_get_cycles+0x8> add $0x20,%rdi
> > ret = pvclock_clocksource_read(src);
> > 00000000000001ac <xen_clocksource_get_cycles+0xc> callq 00000000000001b1 <xen_clocksource_get_cycles+0x11>
> > }
> >
> > static cycle_t xen_clocksource_get_cycles(struct clocksource *cs)
> > {
> > return xen_clocksource_read();
> > }
> > 00000000000001b1 <xen_clocksource_get_cycles+0x11> retq
> > 00000000000001b2 <xen_clocksource_get_cycles+0x12> data16 data16 data16 data16 nopw %cs:0x0(%rax,%rax,1)
> >
> > static void xen_read_wallclock(struct timespec *ts)
> > {
> > struct shared_info *s = HYPERVISOR_shared_info;
> > struct pvclock_wall_clock *wall_clock = &(s->wc);
> > 00000000000001c0 <xen_get_wallclock> mov 0x0(%rip),%rax # 00000000000001c7 <xen_get_wallclock+0x7>
>
>
> Here's another dump, which diverges from xen_clocksource_get_cycles() to
> some completely other function:
>
> > INFO: task btrfs-transacti:522 blocked for more than 300 seconds.
> > btrfs-transacti D 0 522 2 0x00000000
> > ffff8801f3836000 0000000000000000 ffff8801f4a24f00 ffff8801f32da1c0
> > ffff8801f5b18780 ffffc9004199fa40 ffffffff8160cd2d 0000000000000000
> > ffff8801741a58a8 0000000000000102 000000000000000e ffff8801f32da1c0
> > Call Trace:
> > [<ffffffff8160cd2d>] ? __schedule+0x23d/0x6d0
> > [<ffffffff8160da60>] ? bit_wait_timeout+0x90/0x90
> > [<ffffffff8160d1f2>] ? schedule+0x32/0x80
> > [<ffffffff81610729>] ? schedule_timeout+0x249/0x300
> ***
> > [<ffffffff8101b7d1>] ? xen_clocksource_get_cycles+0x11/0x20
> > [<ffffffff8160da60>] ? bit_wait_timeout+0x90/0x90
> > [<ffffffff8160ca74>] ? io_schedule_timeout+0xb4/0x130
> > [<ffffffff810bb5e7>] ? prepare_to_wait+0x57/0x80
> > [<ffffffff8160da77>] ? bit_wait_io+0x17/0x60
> > [<ffffffff8160d56c>] ? __wait_on_bit+0x5c/0x90
> > [<ffffffff8117e18e>] ? find_get_pages_tag+0x15e/0x300
> > [<ffffffff8117d016>] ? wait_on_page_bit+0x86/0xa0
> > [<ffffffff810bb910>] ? autoremove_wake_function+0x40/0x40
> > [<ffffffff8117d107>] ? __filemap_fdatawait_range+0xd7/0x150
> > [<ffffffff8117d18f>] ? filemap_fdatawait_range+0xf/0x30
> > [<ffffffffc0175643>] ? btrfs_wait_ordered_range+0x73/0x110 [btrfs]
> > [<ffffffffc01a2a0d>] ? btrfs_wait_cache_io+0x5d/0x1f0 [btrfs]
> > [<ffffffffc0141a36>] ? btrfs_write_dirty_block_groups+0x106/0x380 [btrfs]
> > [<ffffffffc0140c5d>] ? btrfs_run_delayed_refs+0x1fd/0x2b0 [btrfs]
> > [<ffffffffc01551d7>] ? commit_cowonly_roots+0x257/0x2f0 [btrfs]
> > [<ffffffffc0157a24>] ? btrfs_commit_transaction+0x4e4/0xa40 [btrfs]
> > [<ffffffffc015801d>] ? start_transaction+0x9d/0x4a0 [btrfs]
> > [<ffffffffc01523a2>] ? transaction_kthread+0x1b2/0x1f0 [btrfs]
> > [<ffffffffc01521f0>] ? btrfs_cleanup_transaction+0x580/0x580 [btrfs]
> > [<ffffffff810975c0>] ? kthread+0xf0/0x110
> > [<ffffffff8102476b>] ? __switch_to+0x2bb/0x700
> > [<ffffffff810974d0>] ? kthread_park+0x60/0x60
> > [<ffffffff81611d35>] ? ret_from_fork+0x25/0x30
>
> And another one:
> > INFO: task smbd:20101 blocked for more than 300 seconds.
> > smbd D 0 20101 1714 0x00000000
> > ffff8801f01cac00 0000000000000000 ffff8801f4a241c0 ffff88007f9a5240
> > ffff8801f5b98780 ffffc90049a33bc0 ffffffff8160cd2d ffff8800f78bb5b0
> > fd2236313fb5274b 000000008dc3c1a0 ffff8801f3ef0a40 ffff88007f9a5240
> > Call Trace:
> > [<ffffffff8160cd2d>] ? __schedule+0x23d/0x6d0
> > [<ffffffff8160da60>] ? bit_wait_timeout+0x90/0x90
> > [<ffffffff8160d1f2>] ? schedule+0x32/0x80
> > [<ffffffff81610729>] ? schedule_timeout+0x249/0x300
> > [<ffffffff81336486>] ? __radix_tree_lookup+0x76/0xe0
> ***
> > [<ffffffff8101b7d1>] ? xen_clocksource_get_cycles+0x11/0x20
> > [<ffffffff8160da60>] ? bit_wait_timeout+0x90/0x90
> > [<ffffffff8160ca74>] ? io_schedule_timeout+0xb4/0x130
> > [<ffffffff810bb5e7>] ? prepare_to_wait+0x57/0x80
> > [<ffffffff8160da77>] ? bit_wait_io+0x17/0x60
> > [<ffffffff8160d56c>] ? __wait_on_bit+0x5c/0x90
> > [<ffffffffc00f290b>] ? __jbd2_journal_file_buffer+0xcb/0x180 [jbd2]
> > [<ffffffff8160da60>] ? bit_wait_timeout+0x90/0x90
> > [<ffffffff8160d6ce>] ? out_of_line_wait_on_bit+0x7e/0xa0
> > [<ffffffff810bb910>] ? autoremove_wake_function+0x40/0x40
> > [<ffffffffc00f2bc8>] ? do_get_write_access+0x208/0x420 [jbd2]
> > [<ffffffffc00f2e0e>] ? jbd2_journal_get_write_access+0x2e/0x60 [jbd2]
> > [<ffffffffc02c78a6>] ? __ext4_journal_get_write_access+0x36/0x70 [ext4]
> > [<ffffffffc02a3913>] ? ext4_orphan_add+0xd3/0x230 [ext4]
> > [<ffffffffc0296b4a>] ? ext4_mark_inode_dirty+0x6a/0x200 [ext4]
> > [<ffffffffc02a4b8a>] ? ext4_unlink+0x36a/0x380 [ext4]
> > [<ffffffff81211677>] ? vfs_unlink+0xe7/0x180
> > [<ffffffff81214439>] ? do_unlinkat+0x289/0x300
> > [<ffffffff81611abb>] ? system_call_fast_compare_end+0xc/0x9b
>
> This does not look normal to me or did I miss something?
>
> Where can I get more information on why there is no progress for 300s,
> what should I do to debug which task is waiting for what?
>
> The traces of the of other CPUs look normal to me: the one posted first
> above is the shortest, in all other cases they were sooner or later
> waiting for IO (my interpretation, but I can post them if necessary.)
>
> This problem occurs since the upgrade of the Linux kernel inside the VM
> from 4.1.x to 4.9.32 and now 4.9.52.
>
> Any help is appreciated.
> Philipp Hahn
> --
> Philipp Hahn
> Open Source Software Engineer
>
> Univention GmbH
> be open.
> Mary-Somerville-Str. 1
> D-28359 Bremen
> Tel.: +49 421 22232-0
> Fax : +49 421 22232-99
> hahn@univention.de
>
> http://www.univention.de/
> Geschäftsführer: Peter H. Ganten
> HRB 20755 Amtsgericht Bremen
> Steuer-Nr.: 71-597-02876
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> https://lists.xen.org/xen-devel
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel
next prev parent reply other threads:[~2017-10-05 13:20 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-10-04 18:26 4.9.52: INFO: task XXX blocked for more than 300 seconds Philipp Hahn
2017-10-05 10:12 ` Jan Beulich
2017-10-05 12:04 ` Philipp Hahn
2017-10-05 10:38 ` Roger Pau Monné
2017-10-05 13:20 ` Konrad Rzeszutek Wilk [this message]
2017-10-05 17:59 ` Ankur Arora
2017-10-19 14:34 ` [Xen-devel] 4.9.52: INFO: task X " Philipp Hahn
2017-10-19 14:34 ` Philipp Hahn
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20171005132016.GG24547@char.us.oracle.com \
--to=konrad.wilk@oracle.com \
--cc=ankur.a.arora@oracle.com \
--cc=hahn@univention.de \
--cc=linux-kernel@vger.kernel.org \
--cc=xen-devel@lists.xenproject.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.