From: "Pasi Kärkkäinen" <pasik@iki.fi>
To: Matthias <matthias.kannenberg@googlemail.com>
Cc: "xen-devel@lists.xen.org" <xen-devel@lists.xen.org>,
Ian Campbell <Ian.Campbell@citrix.com>
Subject: Re: Is: RCU callback detects an RCU hang with Linux 3.12+ Was: Re: Status of FLR in Xen 4.4
Date: Fri, 4 Oct 2013 09:07:05 +0300 [thread overview]
Message-ID: <20131004060705.GY2924@reaktio.net> (raw)
In-Reply-To: <CABoYbGqNk8SwNq1OpPTnVfoSDCEVST+9k4_2CgT9xT9k5LBogw@mail.gmail.com>
On Fri, Oct 04, 2013 at 12:34:56AM +0200, Matthias wrote:
> Hi Konrad,
>
> sorry I missed your entry, google mail might not be the best software to
> view mailing lists ;)
>
> The RCU stall happens roughly 2 minutes after the machine is fully booted,
> and I'm usually working via SSH by then..
>
> I basically have two cases where the stall happens:
>
> 1) Without the no-cpuidle function, It happens when I start xencommons
> 2) With or without no-cpuidle, this happens sometimes and arbitrary and I
> have the feeling that logging in via SSH (or network traffic in general?)
> will increase the chance of the rcu stall and (and this is only a guess)
> in most cases this actually happens when I enter a command of more then 16
> chars in the ssh command prompt. (I don't really think that this is really
> causing the issue, I just noticed that when entering the usual commands to
> start all the xen stuff / boot the domUs, it stalls mostly on the same
> commands / when ssh freezes I came to the same part of the command). But
> more ssh-intensive commands like 'dmesg' or 'htop' don't cause it..
>
> Also, I can't really say what is on the screen because my dom0 does not
> have a vga card / both vga cards in the server are passed to different
> domUs and when I don't hide the vga cards on boot via xen-pciback.hide,
> the rcu usually does not stall and everything is fine..
>
For debugging you should have a serial console.. so maybe get a pci serial card,
if you don't have any management processors offering SOL ?
-- Pasi
> 2013/9/27 Konrad Rzeszutek Wilk <[1]konrad.wilk@oracle.com>
>
> On Fri, Sep 27, 2013 at 07:07:33PM +0200, Matthias wrote:
> > Hi Konrad,
> >
> > good call! I was able to reproduce the error with the 3.12-rc2 kernel,
> got
> > a lot of information with the new NMI traces (log attached), but since
> I'm
> > not a xen hacker I don't really know how to continue from here. So I
> might
> > add this to the original post and maybe someone can help me. After all
> the
> > error persists for half a year now and besides 2 kernel version /
> .config
> > Combinations (a 3.8.2 and a 3.6.something) I could never trace this
> issue
> > back (even with bisecting the .config because at some point it seemed
> > random).
>
> Can you tell me a bit on how this happens? Is it happening after you
> boot the machine? Does it happen after a specific workload?
>
> It looks like something in the RCU is taking far too long and
> the RCU callback mechanism starts complaining. The CPU0 is when the
> RCU mechanism detects that something is off and starts sending NMI to
> all CPUs. CPU2 is the only one that looks to be doing RCU callback:
>
> NMI backtrace for cpu 1
> CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.12.0-rc2 #2
> Hardware name: System manufacturer System Product Name/Crosshair IV
> Formula, BIOS 3029 10/09/2012
> task: ffff8800658da080 ti: ffff880065900000 task.ti: ffff880065900000
> RIP: e030:[<ffffffff8125b2b2>] [<ffffffff8125b2b2>]
> cfb_imageblit+0x1b3/0x411
> RSP: e02b:ffff88007de439f0 EFLAGS: 00000046
> RAX: 0000000000000000 RBX: ffff88001e1c2800 RCX: 0000000000000003
> RDX: 000000000000003b RSI: ffff88001e00614e RDI: 0000000000000000
> RBP: 0000000000000013 R08: 0000000000000001 R09: ffffffff814655f0
> R10: ffff88001e006116 R11: ffffc90014875710 R12: 000000000000000d
> R13: 0000000000000000 R14: ffffc90014875714 R15: ffffc90014875000
> FS: 00007fb294ab4900(0000) GS:ffff88007de40000(0000)
> knlGS:0000000000000000
> CS: e033 DS: 002b ES: 002b CR0: 000000008005003b
> CR2: 00007fb29177a9a0 CR3: 000000000160c000 CR4: 0000000000000660
> Stack:
> 0000000100aaaaaa 00000000000001d8 0000000000000000 0000000000aaaaaa
> ffff8800532f0a40 ffff88001e1c2800 0000000000000001 ffff88001e1c2800
> 0000000000000000 ffff88007d424400 00000000ffff00ff 000000000000003b
> Call Trace:
> <IRQ> [<ffffffff81256ac4>] ? bit_putcs+0x352/0x39d
> [<ffffffff81219825>] ? paravirt_read_tsc+0x5/0x8
> [<ffffffff81256772>] ? bit_cursor+0x45d/0x45d
> [<ffffffff812523a8>] ? fbcon_putcs+0xbd/0xcc
> [<ffffffff812bc6b6>] ? vt_console_print+0x234/0x290
> [<ffffffff810b336f>] ? call_console_drivers.constprop.18+0xb3/0xfc
> [<ffffffff810b3c7d>] ? console_unlock+0x131/0x306
> [<ffffffff810b420e>] ? vprintk_emit+0x3bc/0x3eb
> [<ffffffff812c92f5>] ? paravirt_read_tsc+0x5/0x8
> [<ffffffff812cae43>] ? add_interrupt_randomness+0x3f/0x15d
> [<ffffffff813db9c8>] ? printk+0x4f/0x51
> [<ffffffff810e4433>] ? rcu_check_callbacks+0x195/0x598
> <==================
> [<ffffffff810a3b50>] ? irqtime_account_process_tick.isra.2+0xd6/0x239
> [<ffffffff810c232a>] ? tick_sched_do_timer+0x2e/0x2e
> [<ffffffff81084c35>] ? update_process_times+0x30/0x5b
> [<ffffffff810c2237>] ? tick_sched_handle+0x3e/0x4a
> [<ffffffff810c235a>] ? tick_sched_timer+0x30/0x4c
> [<ffffffff81098355>] ? __run_hrtimer+0x93/0x159
> [<ffffffff81098b72>] ? hrtimer_interrupt+0xe3/0x1ca
> [<ffffffff8103d8e4>] ? xen_timer_interrupt+0x31/0x13b
> [<ffffffff81294c4c>] ? HYPERVISOR_event_channel_op+0xd/0x1d
> [<ffffffff8103d79b>] ? xen_force_evtchn_callback+0x9/0xa
> [<ffffffff8103df22>] ? check_events+0x12/0x20
> [<ffffffff810b5b7a>] ? handle_irq_event_percpu+0x4d/0x1c5
> [<ffffffff813e556e>] ? notifier_call_chain+0x32/0x52
> [<ffffffff810b8287>] ? handle_percpu_irq+0x39/0x4c
> [<ffffffff812951c0>] ? __xen_evtchn_do_upcall+0x107/0x2cb
> [<ffffffff81219936>] ? delay_tsc+0x9c/0xc6
> [<ffffffff81093ba0>] ? __rcu_read_unlock+0x33/0x51
> [<ffffffff8129663a>] ? xen_evtchn_do_upcall+0x22/0x32
> [<ffffffff813e897e>] ? xen_do_hypervisor_callback+0x1e/0x30
> <EOI> [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
> [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
> [<ffffffff8103d768>] ? xen_safe_halt+0xc/0x13
> [<ffffffff8104ae0b>] ? default_idle+0x14/0x3e
> [<ffffffff810b53ee>] ? cpu_startup_entry+0x107/0x160
> Code: fb 4c 89 d6 b9 08 00 00 00 ff cd 83 fd ff 74 32 44 0f be 2e 44 29
> c1 8b 44 24 18 4d 8d 73 04 41 d3 fd 44 23 6c 24 04 43 23 04 a9 <41> 89
> c5 41 31 fd 45 89 2b 85 c9 75 05 48 ff c6 b1 08 4d 89 f3
>
> Which looks to be printing something on the VT console (which is running
> in KMS mode as it uses framebuffer calls). So is there something on the
> screen scrolling widly in a loop?
>
> But then there are also complains about
>
> INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long
> to run: 1.115 msecs
>
> this taking too long. I am wondering if there is some time issue
> on your box.
>
> What version of Xen do you have?
> >
> >
> > 2013/9/27 Konrad Rzeszutek Wilk <[2]konrad.wilk@oracle.com>
> >
> > > On Thu, Sep 26, 2013 at 07:59:40PM +0200, Matthias wrote:
> > > > I'm currently on a vanilla 3.8.2 kernel because this is the only
> >3.4
> > > > kernel I found which doesn't give me this issue:
> > > >
> [3]http://lists.xen.org/archives/html/xen-users/2013-02/msg00114.html
> > >
> > > So v3.12 (or rather the latest and greaters of the Linus) has the
> mechanism
> > > for the NMI - so you can actually see what is causing the stall.
> > >
>
> _______________________________________________
> Xen-devel mailing list
> [4]Xen-devel@lists.xen.org
> [5]http://lists.xen.org/xen-devel
>
> References
>
> Visible links
> 1. mailto:konrad.wilk@oracle.com
> 2. mailto:konrad.wilk@oracle.com
> 3. http://lists.xen.org/archives/html/xen-users/2013-02/msg00114.html
> 4. mailto:Xen-devel@lists.xen.org
> 5. http://lists.xen.org/xen-devel
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
next prev parent reply other threads:[~2013-10-04 6:07 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-09-26 16:05 Status of FLR in Xen 4.4 Matthias
2013-09-26 16:16 ` Ian Campbell
2013-09-26 17:59 ` Matthias
2013-09-27 13:34 ` Konrad Rzeszutek Wilk
2013-09-27 17:07 ` Matthias
2013-09-27 17:28 ` Sander Eikelenboom
2013-09-27 19:19 ` Matthias
2013-09-27 19:33 ` Sander Eikelenboom
2013-09-27 19:48 ` Matthias
2013-09-27 20:06 ` Sander Eikelenboom
2013-09-27 17:53 ` Is: RCU callback detects an RCU hang with Linux 3.12+ Was: " Konrad Rzeszutek Wilk
2013-10-03 22:34 ` Matthias
2013-10-04 6:07 ` Pasi Kärkkäinen [this message]
2013-09-26 16:20 ` David Vrabel
2013-09-26 17:48 ` Ross Philipson
2013-09-26 18:01 ` David Vrabel
2013-09-26 18:41 ` Matthias
2013-09-26 19:13 ` Gordan Bobic
2013-09-27 12:26 ` Matthias
2013-09-27 13:27 ` Gordan Bobic
2013-09-27 13:48 ` Konrad Rzeszutek Wilk
2013-09-27 14:00 ` Gordan Bobic
2013-10-03 22:20 ` Matthias
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20131004060705.GY2924@reaktio.net \
--to=pasik@iki.fi \
--cc=Ian.Campbell@citrix.com \
--cc=matthias.kannenberg@googlemail.com \
--cc=xen-devel@lists.xen.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).