Re: [Xen-devel] 4.4: INFO: rcu_sched self-detected stall on CPU

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Boris Ostrovsky <boris.ostrovsky@oracle.com>
To: Steven Haigh <netwiz@crc.id.au>,
	xen-devel <xen-devel@lists.xenproject.org>,
	linux-kernel@vger.kernel.org
Cc: "gregkh@linuxfoundation.org" <gregkh@linuxfoundation.org>
Subject: Re: [Xen-devel] 4.4: INFO: rcu_sched self-detected stall on CPU
Date: Fri, 25 Mar 2016 10:44:48 -0400	[thread overview]
Message-ID: <56F54EE0.6030004@oracle.com> (raw)
In-Reply-To: <56F545B1.8080609@crc.id.au>



On 03/25/2016 10:05 AM, Steven Haigh wrote:
> On 25/03/2016 11:23 PM, Boris Ostrovsky wrote:
>> On 03/24/2016 10:53 PM, Steven Haigh wrote:
>>> Hi all,
>>>
>>> Firstly, I've cross-posted this to xen-devel and the lkml - as this
>>> problem seems to only exist when using kernel 4.4 as a Xen DomU kernel.
>>> I have also CC'ed Greg KH for his awesome insight as maintainer.
>>>
>>> Please CC myself into replies - as I'm not a member of the kernel
>>> mailing list - I may miss replies from monitoring the archives.
>>>
>>> I've noticed recently that heavy disk IO is causing rcu_sched to detect
>>> stalls. The process mentioned usually goes to 100% CPU usage, and
>>> eventually processes start segfaulting and dying. The only fix to
>>> recover the system is to use 'xl destroy' to force-kill the VM and to
>>> start it again.
>>>
>>> The majority of these issues seem to mention ext4 in the trace. This may
>>> indicate an issue there - or may be a red herring.
>>>
>>> The gritty details:
>>> INFO: rcu_sched self-detected stall on CPU
>>> #0110-...: (20999 ticks this GP) idle=327/140000000000001/0
>>> softirq=1101493/1101493 fqs=6973
>>> #011 (t=21000 jiffies g=827095 c=827094 q=524)
>>> Task dump for CPU 0:
>>> rsync           R  running task        0  2446   2444 0x00000088
>>> ffffffff818d0c00 ffff88007fc03c58 ffffffff810a625f 0000000000000000
>>> ffffffff818d0c00 ffff88007fc03c70 ffffffff810a8699 0000000000000001
>>> ffff88007fc03ca0 ffffffff810d0e5a ffff88007fc170c0 ffffffff818d0c00
>>> Call Trace:
>>> <IRQ>  [<ffffffff810a625f>] sched_show_task+0xaf/0x110
>>> [<ffffffff810a8699>] dump_cpu_task+0x39/0x40
>>> [<ffffffff810d0e5a>] rcu_dump_cpu_stacks+0x8a/0xc0
>>> [<ffffffff810d4884>] rcu_check_callbacks+0x424/0x7a0
>>> [<ffffffff810a91e1>] ? account_system_time+0x81/0x110
>>> [<ffffffff810a9481>] ? account_process_tick+0x61/0x160
>>> [<ffffffff810e8050>] ? tick_sched_do_timer+0x30/0x30
>>> [<ffffffff810d9749>] update_process_times+0x39/0x60
>>> [<ffffffff810e7aa6>] tick_sched_handle.isra.15+0x36/0x50
>>> [<ffffffff810e808d>] tick_sched_timer+0x3d/0x70
>>> [<ffffffff810da342>] __hrtimer_run_queues+0xf2/0x250
>>> [<ffffffff810da698>] hrtimer_interrupt+0xa8/0x190
>>> [<ffffffff8100c61e>] xen_timer_interrupt+0x2e/0x140
>>> [<ffffffff810c8555>] handle_irq_event_percpu+0x55/0x1e0
>>> [<ffffffff810cbbca>] handle_percpu_irq+0x3a/0x50
>>> [<ffffffff810c7d22>] generic_handle_irq+0x22/0x30
>>> [<ffffffff813e50ff>] __evtchn_fifo_handle_events+0x15f/0x180
>>> [<ffffffff813e5130>] evtchn_fifo_handle_events+0x10/0x20
>>> [<ffffffff813e2233>] __xen_evtchn_do_upcall+0x43/0x80
>>> [<ffffffff813e3ea0>] xen_evtchn_do_upcall+0x30/0x50
>>> [<ffffffff8165deb2>] xen_hvm_callback_vector+0x82/0x90
>>> <EOI>  [<ffffffff810baf0d>] ? queued_write_lock_slowpath+0x3d/0x80
>>> [<ffffffff8165bcce>] _raw_write_lock+0x1e/0x30
>> This looks to me like ext4 failing to grab a lock. Everything above it
>> (in Xen code) is regular tick interrupt handling which detects the stall.
>>
>> Your config does not have CONFIG_PARAVIRT_SPINLOCKS so that eliminates
>> any possible issues with pv locks.
>>
>> Do you see anything "interesting" in dom0? (e.g. dmesg, xl dmesg,
>> /var/log/xen/) Are you oversubscribing your guest (CPU-wise)?
> There is nothing special being logged anywhere that I can see. dmesg /
> xl dmesg on the Dom0 show nothing unusual.
>
> I do share CPUs - but I don't give any DomU more than 2 vcpus. The
> physical host has 4 cores - 1 pinned to the Dom0.
>
> I log to a remote syslog on this system - and I've uploaded the entire
> log to a pastebin (don't want to do a 45Kb attachment here):
>      http://paste.fedoraproject.org/345095/58914452

That doesn't look like a full log. In any case, the RCU stall may be a 
secondary problem --- there is a bunch of splats before the stall.

-boris


>
> Not sure if it makes any difference at all, but my DomU config is:
> # cat /etc/xen/backup.vm
> name            = "backup.vm"
> memory          = 2048
> vcpus           = 2
> cpus            = "1-3"
> disk            = [ 'phy:/dev/vg_raid1_new/backup.vm,xvda,w' ]
> vif             = [ "mac=00:11:36:35:35:09, bridge=br203,
> vifname=vm.backup, script=vif-bridge" ]
> bootloader      = 'pygrub'
> pvh             = 1
>
> on_poweroff     = 'destroy'
> on_reboot       = 'restart'
> on_crash        = 'restart'
> cpu_weight      = 64
>
> I never had this problem when running kernel 4.1.x - it only started
> when I upgraded everything to 4.4 - not exactly a great help - but may
> help narrow things down?
>
>>> [<ffffffff81289b23>] ext4_es_remove_extent+0x43/0xc0
>>> [<ffffffff812692f9>] ext4_clear_inode+0x39/0x80
>>> [<ffffffff8124cafd>] ext4_evict_inode+0x8d/0x4e0
>>> [<ffffffff811df537>] evict+0xb7/0x180
>>> [<ffffffff811df636>] dispose_list+0x36/0x50
>>> [<ffffffff811e054b>] prune_icache_sb+0x4b/0x60
>>> [<ffffffff811c92c1>] super_cache_scan+0x141/0x190
>>> [<ffffffff811655fe>] shrink_slab.part.37+0x1ee/0x390
>>> [<ffffffff8116950c>] shrink_zone+0x26c/0x280
>>> [<ffffffff8116988c>] do_try_to_free_pages+0x15c/0x410
>>> [<ffffffff81169bfa>] try_to_free_pages+0xba/0x170
>>> [<ffffffff8115d1e5>] __alloc_pages_nodemask+0x525/0xa60
>>> [<ffffffff811a839c>] ? kmem_cache_free+0xcc/0x2c0
>>> [<ffffffff811a120d>] alloc_pages_current+0x8d/0x120
>>> [<ffffffff81154601>] __page_cache_alloc+0x91/0xc0
>>> [<ffffffff81154b66>] pagecache_get_page+0x56/0x1e0
>>> [<ffffffff81154ee6>] grab_cache_page_write_begin+0x26/0x40
>>> [<ffffffff8124c811>] ext4_da_write_begin+0xa1/0x300
>>> [<ffffffff8124d3a4>] ? ext4_da_write_end+0x124/0x2b0
>>> [<ffffffff811543a0>] generic_perform_write+0xc0/0x1a0
>>> [<ffffffff81156408>] __generic_file_write_iter+0x188/0x1e0
>>> [<ffffffff812421d0>] ext4_file_write_iter+0xf0/0x340
>>> [<ffffffff811c59fa>] __vfs_write+0xaa/0xe0
>>> [<ffffffff811c6022>] vfs_write+0xa2/0x1a0
>>> [<ffffffff811c6d26>] SyS_write+0x46/0xa0
>>> [<ffffffff8165c02e>] entry_SYSCALL_64_fastpath+0x12/0x71
>>>
>>> Some 11 hours later:
>>> sshd[785]: segfault at 1f0 ip 00007f03bb94ae5c sp 00007ffe9eb54470 error
>>> 4 in ld-2.17.so[7f03bb940000+21000]
>>> sh[787]: segfault at 1f0 ip 00007f6b4a0dfe5c sp 00007ffe3d4a71e0 error 4
>>> in ld-2.17.so[7f6b4a0d5000+21000]
>>> systemd-cgroups[788]: segfault at 1f0 ip 00007f4baa82ce5c sp
>>> 00007ffd28e4c4b0 error 4 in ld-2.17.so[7f4baa822000+21000]
>>> sshd[791]: segfault at 1f0 ip 00007ff8c8a8ce5c sp 00007ffede9e1c20 error
>>> 4 in ld-2.17.so[7ff8c8a82000+21000]
>>> sshd[792]: segfault at 1f0 ip 00007f183cf75e5c sp 00007ffc81ab7160 error
>>> 4 in ld-2.17.so[7f183cf6b000+21000]
>>> sshd[793]: segfault at 1f0 ip 00007f3c665ece5c sp 00007ffd9a13c850 error
>>> 4 in ld-2.17.so[7f3c665e2000+21000]
>>>
>>>   From isolated testing, this does not occur on kernel 4.5.x - however I
>>> have not verified this myself.
>>>
>>> The kernel config used can be found in the kernel-xen git repo if it
>>> assists in debugging:
>>>      http://xen.crc.id.au/git/?p=kernel-xen;a=summary
>>>
>>>
>>>
>>> _______________________________________________
>>> Xen-devel mailing list
>>> Xen-devel@lists.xen.org
>>> http://lists.xen.org/xen-devel

next prev parent reply	other threads:[~2016-03-25 14:45 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-03-25  2:53 4.4: INFO: rcu_sched self-detected stall on CPU Steven Haigh
2016-03-25  2:53 ` Steven Haigh
2016-03-25 12:23 ` [Xen-devel] " Boris Ostrovsky
2016-03-25 14:05   ` Steven Haigh
2016-03-25 14:05   ` [Xen-devel] " Steven Haigh
2016-03-25 14:44     ` Boris Ostrovsky [this message]
2016-03-25 16:04       ` Steven Haigh
2016-03-25 16:04       ` [Xen-devel] " Steven Haigh
2016-03-25 16:20         ` Boris Ostrovsky
2016-03-25 21:07           ` Steven Haigh
2016-03-25 21:07           ` Steven Haigh
2016-03-29  8:56             ` Steven Haigh
2016-03-29  8:56             ` Steven Haigh
2016-03-29 14:14               ` Boris Ostrovsky
2016-03-29 17:44                 ` Steven Haigh
2016-03-29 17:44                 ` Steven Haigh
2016-03-29 18:04                   ` Steven Haigh
2016-03-29 18:04                   ` Steven Haigh
2016-03-29 18:32                     ` Steven Haigh
2016-03-30 13:44                     ` Boris Ostrovsky
2016-03-30 13:44                     ` Boris Ostrovsky
2016-05-02 20:54                     ` gregkh
2016-05-03 15:11                       ` Steven Haigh
2016-05-03 15:46                         ` gregkh
2016-05-02 20:54                     ` gregkh
2016-04-02  1:50                 ` Steven Haigh
2016-04-02  1:50                 ` Steven Haigh
2016-03-29 14:14               ` Boris Ostrovsky
2016-03-25 16:20         ` Boris Ostrovsky
2016-03-25 14:44     ` Boris Ostrovsky
2016-03-25 12:23 ` Boris Ostrovsky

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=56F54EE0.6030004@oracle.com \
    --to=boris.ostrovsky@oracle.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netwiz@crc.id.au \
    --cc=xen-devel@lists.xenproject.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.