Re: [rcu_sched stall] regression/miss-config ?

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Cc: linux-kernel@vger.kernel.org, Sasha Levin <sasha.levin@oracle.com>
Subject: Re: [rcu_sched stall] regression/miss-config ?
Date: Mon, 16 May 2016 17:58:20 -0700	[thread overview]
Message-ID: <20160517005820.GI3528@linux.vnet.ibm.com> (raw)
In-Reply-To: <67eb4bf6-c3d2-b9af-30ff-713a6d75e773@oracle.com>

On Mon, May 16, 2016 at 12:49:41PM -0700, Santosh Shilimkar wrote:
> On 5/16/2016 10:34 AM, Paul E. McKenney wrote:
> >On Mon, May 16, 2016 at 09:33:57AM -0700, Santosh Shilimkar wrote:
> >>On 5/16/2016 5:03 AM, Paul E. McKenney wrote:
> >>>On Sun, May 15, 2016 at 09:35:40PM -0700, santosh.shilimkar@oracle.com wrote:
> >>>>On 5/15/16 2:18 PM, Santosh Shilimkar wrote:
> >>>>>Hi Paul,
> >>>>>
> >>>>>I was asking Sasha about [1] since other folks in Oracle
> >>>>>also stumbled upon similar RCU stalls with v4.1 kernel in
> >>>>>different workloads. I was reported similar issue with
> >>>>>RDS as well and looking at [1], [2], [3] and [4], thought
> >>>>>of reaching out to see if you can help us to understand
> >>>>>this issue better.
> >>>>>
> >>>>>Have also included RCU specific config used in these
> >>>>>test(s). Its very hard to reproduce the issue but one of
> >>>>>the data point is, it reproduces on systems with larger
> >>>>>CPUs(64+). Same workload with less than 64 CPUs, don't
> >>>>>show the issue. Someone also told me, making use of
> >>>>>SLAB instead SLUB allocator makes difference but I
> >>>>>haven't verified that part for RDS.
> >>>>>
> >>>>>Let me know your thoughts. Thanks in advance !!
> >>>>>
> >>>>One of my colleague told me the pastebin server I used
> >>>>is Oracle internal only so adding the relevant logs along
> >>>>with email.
> >>>>
> >>
> >>[...]
> >>
> >>>>>[1] https://lkml.org/lkml/2014/12/14/304
> >>>>
> >>>>
> >>>>[2]  Log 1 snippet:
> >>>>-----------------------------------------------------------------
> >>>>INFO: rcu_sched self-detected stall on CPU
> >>>>INFO: rcu_sched self-detected stall on CPU { 54}  (t=60000 jiffies
> >>>>g=66023 c=66022 q=0)
> >>>>Task dump for CPU 54:
> >>>>ksoftirqd/54    R  running task        0   389      2 0x00000008
> >>>> 0000000000000007 ffff88ff7f403d38 ffffffff810a8621 0000000000000036
> >>>> ffffffff81ab6540 ffff88ff7f403d58 ffffffff810a86cf 0000000000000086
> >>>> ffffffff81ab6940 ffff88ff7f403d88 ffffffff810e3ad3 ffffffff81ab6540
> >>>>Call Trace:
> >>>> <IRQ>  [<ffffffff810a8621>] sched_show_task+0xb1/0x120
> >>>> [<ffffffff810a86cf>] dump_cpu_task+0x3f/0x50
> >>>> [<ffffffff810e3ad3>] rcu_dump_cpu_stacks+0x83/0xc0
> >>>> [<ffffffff810e490c>] print_cpu_stall+0xfc/0x170
> >>>> [<ffffffff810e5eeb>] __rcu_pending+0x2bb/0x2c0
> >>>> [<ffffffff810e5f8d>] rcu_check_callbacks+0x9d/0x170
> >>>> [<ffffffff810e9772>] update_process_times+0x42/0x70
> >>>> [<ffffffff810fb589>] tick_sched_handle+0x39/0x80
> >>>> [<ffffffff810fb824>] tick_sched_timer+0x44/0x80
> >>>> [<ffffffff810ebc04>] __run_hrtimer+0x74/0x1d0
> >>>> [<ffffffff810fb7e0>] ? tick_nohz_handler+0xa0/0xa0
> >>>> [<ffffffff810ebf92>] hrtimer_interrupt+0x102/0x240
> >>>> [<ffffffff810521f9>] local_apic_timer_interrupt+0x39/0x60
> >>>> [<ffffffff816c47b5>] smp_apic_timer_interrupt+0x45/0x59
> >>>> [<ffffffff816c263e>] apic_timer_interrupt+0x6e/0x80
> >>>> <EOI>  [<ffffffff8118db64>] ? free_one_page+0x164/0x380
> >>>> [<ffffffff8118de43>] ? __free_pages_ok+0xc3/0xe0
> >>>> [<ffffffff8118e775>] __free_pages+0x25/0x40
> >>>> [<ffffffffa21054f0>] rds_message_purge+0x60/0x150 [rds]
> >>>> [<ffffffffa2105624>] rds_message_put+0x44/0x80 [rds]
> >>>> [<ffffffffa21535b4>] rds_ib_send_cqe_handler+0x134/0x2d0 [rds_rdma]
> >>>> [<ffffffff816c102b>] ? _raw_spin_unlock_irqrestore+0x1b/0x50
> >>>> [<ffffffffa18c0273>] ? mlx4_ib_poll_cq+0xb3/0x2a0 [mlx4_ib]
> >>>> [<ffffffffa214c6f1>] poll_cq+0xa1/0xe0 [rds_rdma]
> >>>> [<ffffffffa214d489>] rds_ib_tasklet_fn_send+0x79/0xf0 [rds_rdma]
> >>>
> >>>The most likely possibility is that there is a 60-second-long loop in
> >>>one of the above functions.  This is within bottom-half execution, so
> >>>unfortunately the usual trick of placing cond_resched_rcu_qs() within this
> >>>loop, but outside of any RCU read-side critical section does not work.
> >>>
> >>First of all thanks for explanation.
> >>
> >>There is no loop which can last for 60 seconds in above code since
> >>its just completion queue handler used to free up buffers much like
> >>NIC
> >>drivers bottom half(NAPI). Its done in tasklet context for latency
> >>reasons which RDS care most. Just to get your attention, the RCU
> >>stall is also seen with XEN code too. Log for it end of the email.
> >>
> >>Another important observation is, for RDS if we avoid higher
> >>order page(s) allocation, issue is not reproducible so far.
> >>In other words, for PAGE_SIZE(4K, get_order(bytes) ==0) allocations,
> >>the system continues to run without any issue, so the loop scenario
> >>is ruled out more or less.
> >>
> >>To be specific, with PAGE_SIZE allocations, alloc_pages()
> >>is just allocating a page and __free_page() is used
> >>instead of __free_pages() from below snippet.
> >>
> >>------------------
> >>if (bytes >= PAGE_SIZE)
> >>	page = alloc_pages(gfp, get_order(bytes));
> >>
> >>.....
> >>
> >>(rm->data.op_sg[i].length <= PAGE_SIZE) ?
> >>__free_page(sg_page(&rm->data.op_sg[i])) :
> >>__free_pages(sg_page(&rm->data.op_sg[i]),
> >>get_order(rm->data.op_sg[i].length));
> >>----------------------------
> >
> >This sounds like something to take up with the mm folks.
> >
> Sure. Will do once the link between two issues is established.

Fair enough!

> >>>Therefore, if there really is a loop here, one fix would be to
> >>>periodically unwind back out to run_ksoftirqd(), but setting up so that
> >>>the work would be continued later.  Another fix might be to move this
> >>>from tasklet context to workqueue context, where cond_resched_rcu_qs()
> >>>can be used -- however, this looks a bit like networking code, which
> >>>does not always take kindly to being run in process context (though
> >>>careful use of local_bh_disable() and local_bh_enable() can sometimes
> >>>overcome this issue).  A third fix, which works only if this code does
> >>>not use RCU and does not invoke any code that does use RCU, is to tell
> >>>RCU that it should ignore this code (which will require a little work
> >>>on RCU, as it currently does not tolerate this sort of thing aside from
> >>>the idle threads).  In this last approach, event-tracing calls must use
> >>>the _nonidle suffix.
> >>>
> >>>I am not familiar with the RDS code, so I cannot be more specific.
> >>
> >>No worries. Since we saw the issue with XEN too, I was suspecting
> >>that somehow we didn't have RCU_TREE config setup correctly or
> >>some important RCU patch(s) missing in v4.1 which made it in
> >>later kernels.
> >>
> >>The only common denominator I saw between these two different
> >>usecases (RDS and XEN), was the 'hrtimer_interrupt()' chain
> >>which I believe triggers the rcu_sched() chain.
> >
> >Exactly!
> >
> >>I was thinking of enabling "CONFIG_RCU_TRACE" to see if we can
> >>get more information out of the system. Do you suggest any other
> >>RCU option(s)/patch(s) which we can help us to capture more
> >>information to understand the issue better. I wasn't sure about
> >>options like "RCU_KTHREAD_PRIO", TASKS_RCU etc. Used RCU config
> >>is also mentioned end of the email.
> >
> >Hmmm...  I just now noticed the "All QSes seen" message below.
> >That can happen if the load is quite high, and could as you say
> >one thing to try would be to set CONFIG_RCU_KTHREAD_PRIO=1.
> >This is not free, as it will mean more context switches involving
> >the RCU grace-period kthreads.
> >
> >Another thing to try is to set CONFIG_RCU_CPU_STALL_TIMEOUT to something
> >smaller than 60, say, 21.  This will cause the stall warnings to leave
> >less time before splatting.
> >
> OK will try with CONFIG_RCU_KTHREAD_PRIO=1 &
> CONFIG_RCU_CPU_STALL_TIMEOUT =21 . Indeed the usecases generate very
> high load on the systems.
> 
> >Are you running CONFIG_NO_HZ_FULL=y?  If so, the problem might be that
> >you need more housekeeping CPUs than you currently have configured.
> >
> Yes, CONFIG_NO_HZ_FULL=y. Do you mean "CONFIG_NO_HZ_FULL_ALL=y" for
> book keeping. Seems like without that clock-event code will just use
> CPU0 for things like broadcasting which might become bottleneck.
> This could explain connect the hrtimer_interrupt() path getting slowed
> down because of book keeping bottleneck.
> 
> $cat .config | grep NO_HZ
> CONFIG_NO_HZ_COMMON=y
> # CONFIG_NO_HZ_IDLE is not set
> CONFIG_NO_HZ_FULL=y
> # CONFIG_NO_HZ_FULL_ALL is not set
> # CONFIG_NO_HZ_FULL_SYSIDLE is not set
> CONFIG_NO_HZ=y
> # CONFIG_RCU_FAST_NO_HZ is not set

Yes, CONFIG_NO_HZ_FULL_ALL=y would give you only one CPU for all
housekeeping tasks, including the RCU grace-period kthreads.  So you are
booting without any nohz_full boot parameter?  You can end up with the
same problem with CONFIG_NO_HZ_FULL=y and the nohz_full boot parameter
that you can with CONFIG_NO_HZ_FULL_ALL=y.

							Thanx, Paul

next prev parent reply	other threads:[~2016-05-17  0:58 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-05-15 21:18 [rcu_sched stall] regression/miss-config ? Santosh Shilimkar
2016-05-16  4:35 ` santosh.shilimkar
2016-05-16 12:03   ` Paul E. McKenney
2016-05-16 16:33     ` Santosh Shilimkar
2016-05-16 17:34       ` Paul E. McKenney
2016-05-16 19:49         ` Santosh Shilimkar
2016-05-17  0:58           ` Paul E. McKenney [this message]
2016-05-17 13:46             ` santosh.shilimkar
2016-05-17 19:15               ` Paul E. McKenney
2016-05-19 23:45                 ` Santosh Shilimkar
2016-05-20  4:05                   ` Paul E. McKenney

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160517005820.GI3528@linux.vnet.ibm.com \
    --to=paulmck@linux.vnet.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=santosh.shilimkar@oracle.com \
    --cc=sasha.levin@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox