Re: Is: RCU callback detects an RCU hang with Linux 3.12+ Was: Re: Status of FLR in Xen 4.4

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

From: "Pasi Kärkkäinen" <pasik@iki.fi>
To: Matthias <matthias.kannenberg@googlemail.com>
Cc: "xen-devel@lists.xen.org" <xen-devel@lists.xen.org>,
	Ian Campbell <Ian.Campbell@citrix.com>
Subject: Re: Is: RCU callback detects an RCU hang with Linux 3.12+ Was: Re: Status of FLR in Xen 4.4
Date: Fri, 4 Oct 2013 09:07:05 +0300	[thread overview]
Message-ID: <20131004060705.GY2924@reaktio.net> (raw)
In-Reply-To: <CABoYbGqNk8SwNq1OpPTnVfoSDCEVST+9k4_2CgT9xT9k5LBogw@mail.gmail.com>

On Fri, Oct 04, 2013 at 12:34:56AM +0200, Matthias wrote:
>    Hi Konrad,
> 
>    sorry I missed your entry, google mail might not be the best software to
>    view mailing lists ;)
> 
>    The RCU stall happens roughly 2 minutes after the machine is fully booted,
>    and I'm usually working via SSH by then..
> 
>    I basically have two cases where the stall happens:
> 
>    1) Without the no-cpuidle function, It happens when I start xencommons
>    2) With or without no-cpuidle, this happens sometimes and arbitrary and I
>    have the feeling that logging in via SSH (or network traffic in general?)
>    will increase the chance of the rcu stall and (and this is only a guess)
>    in most cases this actually happens when I enter a command of more then 16
>    chars in the ssh command prompt. (I don't really think that this is really
>    causing the issue, I just noticed that when entering the usual commands to
>    start all the xen stuff / boot the domUs, it stalls mostly on the same
>    commands / when ssh freezes I came to the same part of the command). But
>    more ssh-intensive commands like 'dmesg' or 'htop' don't cause it..
> 
>    Also, I can't really say what is on the screen because my dom0 does not
>    have a vga card / both vga cards in the server are passed to different
>    domUs and when I don't hide the vga cards on boot via xen-pciback.hide,
>    the rcu usually does not stall and everything is fine..
> 

For debugging you should have a serial console.. so maybe get a pci serial card,
if you don't have any management processors offering SOL ? 

-- Pasi

>    2013/9/27 Konrad Rzeszutek Wilk <[1]konrad.wilk@oracle.com>
> 
>      On Fri, Sep 27, 2013 at 07:07:33PM +0200, Matthias wrote:
>      > Hi Konrad,
>      >
>      > good call! I was able to reproduce the error with the 3.12-rc2 kernel,
>      got
>      > a lot of information with the new NMI traces (log attached), but since
>      I'm
>      > not a xen hacker I don't really know how to continue from here. So I
>      might
>      > add this to the original post and maybe someone can help me. After all
>      the
>      > error persists for half a year now and besides 2 kernel version /
>      .config
>      > Combinations (a 3.8.2 and a 3.6.something) I could never trace this
>      issue
>      > back (even with bisecting the .config because at some point it seemed
>      > random).
> 
>      Can you tell me a bit on how this happens? Is it happening after you
>      boot the machine? Does it happen after a specific workload?
> 
>      It looks like something in the RCU is taking far too long and
>      the RCU callback mechanism starts complaining. The CPU0 is when the
>      RCU mechanism detects that something is off and starts sending NMI to
>      all CPUs. CPU2 is the only one that looks to be doing RCU callback:
> 
>      NMI backtrace for cpu 1
>      CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.12.0-rc2 #2
>      Hardware name: System manufacturer System Product Name/Crosshair IV
>      Formula, BIOS 3029    10/09/2012
>      task: ffff8800658da080 ti: ffff880065900000 task.ti: ffff880065900000
>      RIP: e030:[<ffffffff8125b2b2>]  [<ffffffff8125b2b2>]
>      cfb_imageblit+0x1b3/0x411
>      RSP: e02b:ffff88007de439f0  EFLAGS: 00000046
>      RAX: 0000000000000000 RBX: ffff88001e1c2800 RCX: 0000000000000003
>      RDX: 000000000000003b RSI: ffff88001e00614e RDI: 0000000000000000
>      RBP: 0000000000000013 R08: 0000000000000001 R09: ffffffff814655f0
>      R10: ffff88001e006116 R11: ffffc90014875710 R12: 000000000000000d
>      R13: 0000000000000000 R14: ffffc90014875714 R15: ffffc90014875000
>      FS:  00007fb294ab4900(0000) GS:ffff88007de40000(0000)
>      knlGS:0000000000000000
>      CS:  e033 DS: 002b ES: 002b CR0: 000000008005003b
>      CR2: 00007fb29177a9a0 CR3: 000000000160c000 CR4: 0000000000000660
>      Stack:
>       0000000100aaaaaa 00000000000001d8 0000000000000000 0000000000aaaaaa
>       ffff8800532f0a40 ffff88001e1c2800 0000000000000001 ffff88001e1c2800
>       0000000000000000 ffff88007d424400 00000000ffff00ff 000000000000003b
>      Call Trace:
>       <IRQ>  [<ffffffff81256ac4>] ? bit_putcs+0x352/0x39d
>       [<ffffffff81219825>] ? paravirt_read_tsc+0x5/0x8
>       [<ffffffff81256772>] ? bit_cursor+0x45d/0x45d
>       [<ffffffff812523a8>] ? fbcon_putcs+0xbd/0xcc
>       [<ffffffff812bc6b6>] ? vt_console_print+0x234/0x290
>       [<ffffffff810b336f>] ? call_console_drivers.constprop.18+0xb3/0xfc
>       [<ffffffff810b3c7d>] ? console_unlock+0x131/0x306
>       [<ffffffff810b420e>] ? vprintk_emit+0x3bc/0x3eb
>       [<ffffffff812c92f5>] ? paravirt_read_tsc+0x5/0x8
>       [<ffffffff812cae43>] ? add_interrupt_randomness+0x3f/0x15d
>       [<ffffffff813db9c8>] ? printk+0x4f/0x51
>       [<ffffffff810e4433>] ? rcu_check_callbacks+0x195/0x598
>               <==================
>       [<ffffffff810a3b50>] ? irqtime_account_process_tick.isra.2+0xd6/0x239
>       [<ffffffff810c232a>] ? tick_sched_do_timer+0x2e/0x2e
>       [<ffffffff81084c35>] ? update_process_times+0x30/0x5b
>       [<ffffffff810c2237>] ? tick_sched_handle+0x3e/0x4a
>       [<ffffffff810c235a>] ? tick_sched_timer+0x30/0x4c
>       [<ffffffff81098355>] ? __run_hrtimer+0x93/0x159
>       [<ffffffff81098b72>] ? hrtimer_interrupt+0xe3/0x1ca
>       [<ffffffff8103d8e4>] ? xen_timer_interrupt+0x31/0x13b
>       [<ffffffff81294c4c>] ? HYPERVISOR_event_channel_op+0xd/0x1d
>       [<ffffffff8103d79b>] ? xen_force_evtchn_callback+0x9/0xa
>       [<ffffffff8103df22>] ? check_events+0x12/0x20
>       [<ffffffff810b5b7a>] ? handle_irq_event_percpu+0x4d/0x1c5
>       [<ffffffff813e556e>] ? notifier_call_chain+0x32/0x52
>       [<ffffffff810b8287>] ? handle_percpu_irq+0x39/0x4c
>       [<ffffffff812951c0>] ? __xen_evtchn_do_upcall+0x107/0x2cb
>       [<ffffffff81219936>] ? delay_tsc+0x9c/0xc6
>       [<ffffffff81093ba0>] ? __rcu_read_unlock+0x33/0x51
>       [<ffffffff8129663a>] ? xen_evtchn_do_upcall+0x22/0x32
>       [<ffffffff813e897e>] ? xen_do_hypervisor_callback+0x1e/0x30
>       <EOI>  [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
>       [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
>       [<ffffffff8103d768>] ? xen_safe_halt+0xc/0x13
>       [<ffffffff8104ae0b>] ? default_idle+0x14/0x3e
>       [<ffffffff810b53ee>] ? cpu_startup_entry+0x107/0x160
>      Code: fb 4c 89 d6 b9 08 00 00 00 ff cd 83 fd ff 74 32 44 0f be 2e 44 29
>      c1 8b 44 24 18 4d 8d 73 04 41 d3 fd 44 23 6c 24 04 43 23 04 a9 <41> 89
>      c5 41 31 fd 45 89 2b 85 c9 75 05 48 ff c6 b1 08 4d 89 f3
> 
>      Which looks to be printing something on the VT console (which is running
>      in KMS mode as it uses framebuffer calls). So is there something on the
>      screen scrolling widly in a loop?
> 
>      But then there are also complains about
> 
>      INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long
>      to run: 1.115 msecs
> 
>      this taking too long. I am wondering if there is some time issue
>      on your box.
> 
>      What version of Xen do you have?
>      >
>      >
>      > 2013/9/27 Konrad Rzeszutek Wilk <[2]konrad.wilk@oracle.com>
>      >
>      > > On Thu, Sep 26, 2013 at 07:59:40PM +0200, Matthias wrote:
>      > > > I'm currently on a vanilla 3.8.2 kernel because this is the only
>      >3.4
>      > > > kernel I found which doesn't give me this issue:
>      > > >
>      [3]http://lists.xen.org/archives/html/xen-users/2013-02/msg00114.html
>      > >
>      > > So v3.12 (or rather the latest and greaters of the Linus) has the
>      mechanism
>      > > for the NMI - so you can actually see what is causing the stall.
>      > >
> 
>      _______________________________________________
>      Xen-devel mailing list
>      [4]Xen-devel@lists.xen.org
>      [5]http://lists.xen.org/xen-devel
> 
> References
> 
>    Visible links
>    1. mailto:konrad.wilk@oracle.com
>    2. mailto:konrad.wilk@oracle.com
>    3. http://lists.xen.org/archives/html/xen-users/2013-02/msg00114.html
>    4. mailto:Xen-devel@lists.xen.org
>    5. http://lists.xen.org/xen-devel

> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

next prev parent reply	other threads:[~2013-10-04  6:07 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-09-26 16:05 Status of FLR in Xen 4.4 Matthias
2013-09-26 16:16 ` Ian Campbell
2013-09-26 17:59   ` Matthias
2013-09-27 13:34     ` Konrad Rzeszutek Wilk
2013-09-27 17:07       ` Matthias
2013-09-27 17:28         ` Sander Eikelenboom
2013-09-27 19:19           ` Matthias
2013-09-27 19:33             ` Sander Eikelenboom
2013-09-27 19:48               ` Matthias
2013-09-27 20:06                 ` Sander Eikelenboom
2013-09-27 17:53         ` Is: RCU callback detects an RCU hang with Linux 3.12+ Was: " Konrad Rzeszutek Wilk
2013-10-03 22:34           ` Matthias
2013-10-04  6:07             ` Pasi Kärkkäinen [this message]
2013-09-26 16:20 ` David Vrabel
2013-09-26 17:48   ` Ross Philipson
2013-09-26 18:01     ` David Vrabel
2013-09-26 18:41       ` Matthias
2013-09-26 19:13         ` Gordan Bobic
2013-09-27 12:26           ` Matthias
2013-09-27 13:27             ` Gordan Bobic
2013-09-27 13:48               ` Konrad Rzeszutek Wilk
2013-09-27 14:00                 ` Gordan Bobic
2013-10-03 22:20       ` Matthias

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20131004060705.GY2924@reaktio.net \
    --to=pasik@iki.fi \
    --cc=Ian.Campbell@citrix.com \
    --cc=matthias.kannenberg@googlemail.com \
    --cc=xen-devel@lists.xen.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).