4.4: INFO: rcu_sched self-detected stall on CPU

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Steven Haigh <netwiz@crc.id.au>
To: xen-devel <xen-devel@lists.xenproject.org>, linux-kernel@vger.kernel.org
Cc: "gregkh@linuxfoundation.org" <gregkh@linuxfoundation.org>
Subject: 4.4: INFO: rcu_sched self-detected stall on CPU
Date: Fri, 25 Mar 2016 13:53:10 +1100	[thread overview]
Message-ID: <56F4A816.3050505@crc.id.au> (raw)


[-- Attachment #1.1: Type: text/plain, Size: 4941 bytes --]

Hi all,

Firstly, I've cross-posted this to xen-devel and the lkml - as this
problem seems to only exist when using kernel 4.4 as a Xen DomU kernel.
I have also CC'ed Greg KH for his awesome insight as maintainer.

Please CC myself into replies - as I'm not a member of the kernel
mailing list - I may miss replies from monitoring the archives.

I've noticed recently that heavy disk IO is causing rcu_sched to detect
stalls. The process mentioned usually goes to 100% CPU usage, and
eventually processes start segfaulting and dying. The only fix to
recover the system is to use 'xl destroy' to force-kill the VM and to
start it again.

The majority of these issues seem to mention ext4 in the trace. This may
indicate an issue there - or may be a red herring.

The gritty details:
INFO: rcu_sched self-detected stall on CPU
#0110-...: (20999 ticks this GP) idle=327/140000000000001/0
softirq=1101493/1101493 fqs=6973
#011 (t=21000 jiffies g=827095 c=827094 q=524)
Task dump for CPU 0:
rsync           R  running task        0  2446   2444 0x00000088
ffffffff818d0c00 ffff88007fc03c58 ffffffff810a625f 0000000000000000
ffffffff818d0c00 ffff88007fc03c70 ffffffff810a8699 0000000000000001
ffff88007fc03ca0 ffffffff810d0e5a ffff88007fc170c0 ffffffff818d0c00
Call Trace:
<IRQ>  [<ffffffff810a625f>] sched_show_task+0xaf/0x110
[<ffffffff810a8699>] dump_cpu_task+0x39/0x40
[<ffffffff810d0e5a>] rcu_dump_cpu_stacks+0x8a/0xc0
[<ffffffff810d4884>] rcu_check_callbacks+0x424/0x7a0
[<ffffffff810a91e1>] ? account_system_time+0x81/0x110
[<ffffffff810a9481>] ? account_process_tick+0x61/0x160
[<ffffffff810e8050>] ? tick_sched_do_timer+0x30/0x30
[<ffffffff810d9749>] update_process_times+0x39/0x60
[<ffffffff810e7aa6>] tick_sched_handle.isra.15+0x36/0x50
[<ffffffff810e808d>] tick_sched_timer+0x3d/0x70
[<ffffffff810da342>] __hrtimer_run_queues+0xf2/0x250
[<ffffffff810da698>] hrtimer_interrupt+0xa8/0x190
[<ffffffff8100c61e>] xen_timer_interrupt+0x2e/0x140
[<ffffffff810c8555>] handle_irq_event_percpu+0x55/0x1e0
[<ffffffff810cbbca>] handle_percpu_irq+0x3a/0x50
[<ffffffff810c7d22>] generic_handle_irq+0x22/0x30
[<ffffffff813e50ff>] __evtchn_fifo_handle_events+0x15f/0x180
[<ffffffff813e5130>] evtchn_fifo_handle_events+0x10/0x20
[<ffffffff813e2233>] __xen_evtchn_do_upcall+0x43/0x80
[<ffffffff813e3ea0>] xen_evtchn_do_upcall+0x30/0x50
[<ffffffff8165deb2>] xen_hvm_callback_vector+0x82/0x90
<EOI>  [<ffffffff810baf0d>] ? queued_write_lock_slowpath+0x3d/0x80
[<ffffffff8165bcce>] _raw_write_lock+0x1e/0x30
[<ffffffff81289b23>] ext4_es_remove_extent+0x43/0xc0
[<ffffffff812692f9>] ext4_clear_inode+0x39/0x80
[<ffffffff8124cafd>] ext4_evict_inode+0x8d/0x4e0
[<ffffffff811df537>] evict+0xb7/0x180
[<ffffffff811df636>] dispose_list+0x36/0x50
[<ffffffff811e054b>] prune_icache_sb+0x4b/0x60
[<ffffffff811c92c1>] super_cache_scan+0x141/0x190
[<ffffffff811655fe>] shrink_slab.part.37+0x1ee/0x390
[<ffffffff8116950c>] shrink_zone+0x26c/0x280
[<ffffffff8116988c>] do_try_to_free_pages+0x15c/0x410
[<ffffffff81169bfa>] try_to_free_pages+0xba/0x170
[<ffffffff8115d1e5>] __alloc_pages_nodemask+0x525/0xa60
[<ffffffff811a839c>] ? kmem_cache_free+0xcc/0x2c0
[<ffffffff811a120d>] alloc_pages_current+0x8d/0x120
[<ffffffff81154601>] __page_cache_alloc+0x91/0xc0
[<ffffffff81154b66>] pagecache_get_page+0x56/0x1e0
[<ffffffff81154ee6>] grab_cache_page_write_begin+0x26/0x40
[<ffffffff8124c811>] ext4_da_write_begin+0xa1/0x300
[<ffffffff8124d3a4>] ? ext4_da_write_end+0x124/0x2b0
[<ffffffff811543a0>] generic_perform_write+0xc0/0x1a0
[<ffffffff81156408>] __generic_file_write_iter+0x188/0x1e0
[<ffffffff812421d0>] ext4_file_write_iter+0xf0/0x340
[<ffffffff811c59fa>] __vfs_write+0xaa/0xe0
[<ffffffff811c6022>] vfs_write+0xa2/0x1a0
[<ffffffff811c6d26>] SyS_write+0x46/0xa0
[<ffffffff8165c02e>] entry_SYSCALL_64_fastpath+0x12/0x71

Some 11 hours later:
sshd[785]: segfault at 1f0 ip 00007f03bb94ae5c sp 00007ffe9eb54470 error
4 in ld-2.17.so[7f03bb940000+21000]
sh[787]: segfault at 1f0 ip 00007f6b4a0dfe5c sp 00007ffe3d4a71e0 error 4
in ld-2.17.so[7f6b4a0d5000+21000]
systemd-cgroups[788]: segfault at 1f0 ip 00007f4baa82ce5c sp
00007ffd28e4c4b0 error 4 in ld-2.17.so[7f4baa822000+21000]
sshd[791]: segfault at 1f0 ip 00007ff8c8a8ce5c sp 00007ffede9e1c20 error
4 in ld-2.17.so[7ff8c8a82000+21000]
sshd[792]: segfault at 1f0 ip 00007f183cf75e5c sp 00007ffc81ab7160 error
4 in ld-2.17.so[7f183cf6b000+21000]
sshd[793]: segfault at 1f0 ip 00007f3c665ece5c sp 00007ffd9a13c850 error
4 in ld-2.17.so[7f3c665e2000+21000]

From isolated testing, this does not occur on kernel 4.5.x - however I
have not verified this myself.

The kernel config used can be found in the kernel-xen git repo if it
assists in debugging:
	http://xen.crc.id.au/git/?p=kernel-xen;a=summary

-- 
Steven Haigh

Email: netwiz@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

WARNING: multiple messages have this Message-ID (diff)

From: Steven Haigh <netwiz@crc.id.au>
To: xen-devel <xen-devel@lists.xenproject.org>, linux-kernel@vger.kernel.org
Cc: "gregkh@linuxfoundation.org" <gregkh@linuxfoundation.org>
Subject: 4.4: INFO: rcu_sched self-detected stall on CPU
Date: Fri, 25 Mar 2016 13:53:10 +1100	[thread overview]
Message-ID: <56F4A816.3050505@crc.id.au> (raw)


[-- Attachment #1.1.1: Type: text/plain, Size: 4941 bytes --]

Hi all,

Firstly, I've cross-posted this to xen-devel and the lkml - as this
problem seems to only exist when using kernel 4.4 as a Xen DomU kernel.
I have also CC'ed Greg KH for his awesome insight as maintainer.

Please CC myself into replies - as I'm not a member of the kernel
mailing list - I may miss replies from monitoring the archives.

I've noticed recently that heavy disk IO is causing rcu_sched to detect
stalls. The process mentioned usually goes to 100% CPU usage, and
eventually processes start segfaulting and dying. The only fix to
recover the system is to use 'xl destroy' to force-kill the VM and to
start it again.

The majority of these issues seem to mention ext4 in the trace. This may
indicate an issue there - or may be a red herring.

The gritty details:
INFO: rcu_sched self-detected stall on CPU
#0110-...: (20999 ticks this GP) idle=327/140000000000001/0
softirq=1101493/1101493 fqs=6973
#011 (t=21000 jiffies g=827095 c=827094 q=524)
Task dump for CPU 0:
rsync           R  running task        0  2446   2444 0x00000088
ffffffff818d0c00 ffff88007fc03c58 ffffffff810a625f 0000000000000000
ffffffff818d0c00 ffff88007fc03c70 ffffffff810a8699 0000000000000001
ffff88007fc03ca0 ffffffff810d0e5a ffff88007fc170c0 ffffffff818d0c00
Call Trace:
<IRQ>  [<ffffffff810a625f>] sched_show_task+0xaf/0x110
[<ffffffff810a8699>] dump_cpu_task+0x39/0x40
[<ffffffff810d0e5a>] rcu_dump_cpu_stacks+0x8a/0xc0
[<ffffffff810d4884>] rcu_check_callbacks+0x424/0x7a0
[<ffffffff810a91e1>] ? account_system_time+0x81/0x110
[<ffffffff810a9481>] ? account_process_tick+0x61/0x160
[<ffffffff810e8050>] ? tick_sched_do_timer+0x30/0x30
[<ffffffff810d9749>] update_process_times+0x39/0x60
[<ffffffff810e7aa6>] tick_sched_handle.isra.15+0x36/0x50
[<ffffffff810e808d>] tick_sched_timer+0x3d/0x70
[<ffffffff810da342>] __hrtimer_run_queues+0xf2/0x250
[<ffffffff810da698>] hrtimer_interrupt+0xa8/0x190
[<ffffffff8100c61e>] xen_timer_interrupt+0x2e/0x140
[<ffffffff810c8555>] handle_irq_event_percpu+0x55/0x1e0
[<ffffffff810cbbca>] handle_percpu_irq+0x3a/0x50
[<ffffffff810c7d22>] generic_handle_irq+0x22/0x30
[<ffffffff813e50ff>] __evtchn_fifo_handle_events+0x15f/0x180
[<ffffffff813e5130>] evtchn_fifo_handle_events+0x10/0x20
[<ffffffff813e2233>] __xen_evtchn_do_upcall+0x43/0x80
[<ffffffff813e3ea0>] xen_evtchn_do_upcall+0x30/0x50
[<ffffffff8165deb2>] xen_hvm_callback_vector+0x82/0x90
<EOI>  [<ffffffff810baf0d>] ? queued_write_lock_slowpath+0x3d/0x80
[<ffffffff8165bcce>] _raw_write_lock+0x1e/0x30
[<ffffffff81289b23>] ext4_es_remove_extent+0x43/0xc0
[<ffffffff812692f9>] ext4_clear_inode+0x39/0x80
[<ffffffff8124cafd>] ext4_evict_inode+0x8d/0x4e0
[<ffffffff811df537>] evict+0xb7/0x180
[<ffffffff811df636>] dispose_list+0x36/0x50
[<ffffffff811e054b>] prune_icache_sb+0x4b/0x60
[<ffffffff811c92c1>] super_cache_scan+0x141/0x190
[<ffffffff811655fe>] shrink_slab.part.37+0x1ee/0x390
[<ffffffff8116950c>] shrink_zone+0x26c/0x280
[<ffffffff8116988c>] do_try_to_free_pages+0x15c/0x410
[<ffffffff81169bfa>] try_to_free_pages+0xba/0x170
[<ffffffff8115d1e5>] __alloc_pages_nodemask+0x525/0xa60
[<ffffffff811a839c>] ? kmem_cache_free+0xcc/0x2c0
[<ffffffff811a120d>] alloc_pages_current+0x8d/0x120
[<ffffffff81154601>] __page_cache_alloc+0x91/0xc0
[<ffffffff81154b66>] pagecache_get_page+0x56/0x1e0
[<ffffffff81154ee6>] grab_cache_page_write_begin+0x26/0x40
[<ffffffff8124c811>] ext4_da_write_begin+0xa1/0x300
[<ffffffff8124d3a4>] ? ext4_da_write_end+0x124/0x2b0
[<ffffffff811543a0>] generic_perform_write+0xc0/0x1a0
[<ffffffff81156408>] __generic_file_write_iter+0x188/0x1e0
[<ffffffff812421d0>] ext4_file_write_iter+0xf0/0x340
[<ffffffff811c59fa>] __vfs_write+0xaa/0xe0
[<ffffffff811c6022>] vfs_write+0xa2/0x1a0
[<ffffffff811c6d26>] SyS_write+0x46/0xa0
[<ffffffff8165c02e>] entry_SYSCALL_64_fastpath+0x12/0x71

Some 11 hours later:
sshd[785]: segfault at 1f0 ip 00007f03bb94ae5c sp 00007ffe9eb54470 error
4 in ld-2.17.so[7f03bb940000+21000]
sh[787]: segfault at 1f0 ip 00007f6b4a0dfe5c sp 00007ffe3d4a71e0 error 4
in ld-2.17.so[7f6b4a0d5000+21000]
systemd-cgroups[788]: segfault at 1f0 ip 00007f4baa82ce5c sp
00007ffd28e4c4b0 error 4 in ld-2.17.so[7f4baa822000+21000]
sshd[791]: segfault at 1f0 ip 00007ff8c8a8ce5c sp 00007ffede9e1c20 error
4 in ld-2.17.so[7ff8c8a82000+21000]
sshd[792]: segfault at 1f0 ip 00007f183cf75e5c sp 00007ffc81ab7160 error
4 in ld-2.17.so[7f183cf6b000+21000]
sshd[793]: segfault at 1f0 ip 00007f3c665ece5c sp 00007ffd9a13c850 error
4 in ld-2.17.so[7f3c665e2000+21000]

From isolated testing, this does not occur on kernel 4.5.x - however I
have not verified this myself.

The kernel config used can be found in the kernel-xen git repo if it
assists in debugging:
	http://xen.crc.id.au/git/?p=kernel-xen;a=summary

-- 
Steven Haigh

Email: netwiz@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897


[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

next             reply	other threads:[~2016-03-25  3:00 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-03-25  2:53 Steven Haigh [this message]
2016-03-25  2:53 ` 4.4: INFO: rcu_sched self-detected stall on CPU Steven Haigh
2016-03-25 12:23 ` [Xen-devel] " Boris Ostrovsky
2016-03-25 14:05   ` Steven Haigh
2016-03-25 14:44     ` Boris Ostrovsky
2016-03-25 14:44     ` [Xen-devel] " Boris Ostrovsky
2016-03-25 16:04       ` Steven Haigh
2016-03-25 16:04       ` [Xen-devel] " Steven Haigh
2016-03-25 16:20         ` Boris Ostrovsky
2016-03-25 16:20         ` [Xen-devel] " Boris Ostrovsky
2016-03-25 21:07           ` Steven Haigh
2016-03-29  8:56             ` Steven Haigh
2016-03-29 14:14               ` Boris Ostrovsky
2016-03-29 17:44                 ` Steven Haigh
2016-03-29 18:04                   ` Steven Haigh
2016-03-29 18:04                   ` Steven Haigh
2016-03-29 18:32                     ` Steven Haigh
2016-03-30 13:44                     ` Boris Ostrovsky
2016-03-30 13:44                     ` Boris Ostrovsky
2016-05-02 20:54                     ` gregkh
2016-05-03 15:11                       ` Steven Haigh
2016-05-03 15:46                         ` gregkh
2016-05-02 20:54                     ` gregkh
2016-03-29 17:44                 ` Steven Haigh
2016-04-02  1:50                 ` Steven Haigh
2016-04-02  1:50                 ` Steven Haigh
2016-03-29 14:14               ` Boris Ostrovsky
2016-03-29  8:56             ` Steven Haigh
2016-03-25 21:07           ` Steven Haigh
2016-03-25 14:05   ` Steven Haigh
2016-03-25 12:23 ` Boris Ostrovsky

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=56F4A816.3050505@crc.id.au \
    --to=netwiz@crc.id.au \
    --cc=gregkh@linuxfoundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=xen-devel@lists.xenproject.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.